# Python Packages for Machine Learning 

![](https://miro.medium.com/max/633/1*Y2v3PrF1rUQRUHwOcXJznA.png)


## TOC 

1. [Numpy](#numpy)
2. [Matplotlib](#matplotlib)
3. [Pandas](#pandas)
4. [Scikit-learn](#scikit-learn)
5. [Dask](#dask)


<a id='numpy'></a>

# Numpy 

NumPy is short for **Num**erical **Py**thon:

- Important package for Scientific Computing
- N-dimensional array object
- More efficient than python objects
- Also has Mathematical Functions
- [Documentation](https://docs.scipy.org/doc/numpy/)

Let's get started by importing the module

In [None]:
# Import NumPy
import numpy as np

## N-Dimensional Arrays 

N-dimensional arrays (ndarray), typically just called arrays, are the core data structure in Numpy.  Let's get started by creating some ndarrays

###  1-D Arrays

We can convert a list of sequence of data to a numpy array as follows: 

In [None]:
test_array = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
test_array

We can check the shape of the array via the following attribute; it returns a tuple describing the shape.

In [None]:
test_array.shape

### 2D Arrays

Arrays can be multi dimensional.  Below we create an array with 2 rows and 5 columns.

In [None]:
two_dim_test_array = np.array([[1,2,3,4,5],[6,7,8,9,10]])
two_dim_test_array

In [None]:
two_dim_test_array.shape

We can also check out the **ndim** attribute to find the number of dimensions in our array

In [None]:
two_dim_test_array.ndim

### Datatypes in NumPy Arrays

Numpy arrays datatypes are homogenous.  Datatypes available include:

- int8, uint8
- int16, uint16
- int32, uint32
- int64, uint64
- float16
- float32
- plus more! 

Below we check the data type of an array and change it:

In [None]:
test_array

In [None]:
test_array.dtype

Data types can also be specified at creation time.

In [None]:
test_array_128 = np.array(test_array, dtype='float128')

In [None]:
test_array_128.dtype

### Creating Numpy Arrays with Builtin Functions

Numpy has built in functions which automatically create arrays.  

Below we demo one option,  `np.arange`, which is essentially the same as range in Python

`np.arange(start, stop, step_size)`

In [None]:
test_array = np.arange(1.0, 18., 2.3184)
print(test_array)

test_array.shape

### Reshaping arrays

We can reshape arrays as follows

In [None]:
print(test_array.reshape(4,2))

In [None]:
print(len(test_array.shape))

Note, the `reshape()` function itself does not modify original array.  Need to assign the returned value to a variable. 

In [None]:
test_array = test_array.reshape(4,2)

In [None]:
test_array.shape

### Broadcasting arrays

Below we demo some of the broadcasting rules for numpy arrays. To get started lets create a few arrays

In [None]:
array1=np.array([1,2,3,4,5,6,7,8])
array2=2* np.arange(1,9,1)
array2

#### Combining Two 1D Arrays

In [None]:
array1 + array2

In [None]:
array1 / array2

#### Broadcasting for Scalars and Arrays

In [None]:
2*array1

In [None]:
array1 > 3

FYI, there are some interesting broadcasting rules for multidimensional arrays.  Check out the [docs](https://numpy.org/devdocs/user/basics.broadcasting.html?highlight=shape) for more information. 

### Slicing Arrays 

Below we demo a few ways to slice arrays.  See the figures above the code for visualization of output found.

In [None]:
array = np.arange(1,8,1)
test_array = np.array([array,3*array,5*array,7*array])
test_array

<img src="img/numpy1.jpg" width=400>

In [None]:
print(test_array[2])

<img src="img/numpy2.jpg" width=400>

In [None]:
print(test_array[1:3])

<img src="img/numpy3.jpg" width=400>

In [None]:
print(test_array[2,5])

<img src="img/numpy4.jpg" width=400>

In [None]:
print(test_array[1:3, 2:5])

<a id='matplotlib'></a>

# Matplotlib

Matplotlib has two styles interfaces when using the library: 

- explicit "axes" interface
   * gives user full control over the figure 
   * harder to use, but gives user more control
- implicit "pyplot" interface
   * matplotlib infers what user wants
   * easier to use, but give user less control
   
Read the [docs](https://matplotlib.org/devdocs/users/explain/api_interfaces.html) for more information
   
## pyplot interface

In [None]:
# For plots inline in notebook
%matplotlib inline 

# Import matplotlib
import matplotlib.pyplot as plt

In the demo below we create several plots with the same data.  Note that in the various demos we just change a few parameters to change things like the color of the graph, scatter versus plot, etc. 

In [None]:
# Setup X and Y values
x = np.linspace(0,25, 20)
y = np.sqrt(x)

# Create plot
plt.xlabel('This is X label')
plt.ylabel('This is Y label')
plt.plot(x,y);

In [None]:
# Blue circles
plt.xlabel('This is X label')
plt.ylabel('This is Y label')
plt.plot(x,y,'o');

In [None]:
# Red circles
plt.xlabel('This is X label')
plt.ylabel('This is Y label')
plt.plot(x,y,'ro');

In [None]:
# Green Lines
plt.xlabel('This is X label')
plt.ylabel('This is Y label')
plt.plot(x,y, 'g-');

In [None]:
# Dotted Lines
plt.xlabel('This is X label')
plt.ylabel('This is Y label')
plt.plot(x,y,'--');

In [None]:
# Bar Plot (histogram)
plt.xlabel('This is X label')
plt.ylabel('This is Y label')
plt.bar(x,y);

## Axes interface

In the demo below we will create two plots in one figure using the axes interface.  The two plots will show a sine and cosine wave

In [None]:
#get data for plot demo
x=np.arange(0,1,0.025)
y1=10*np.cos(2*np.pi*5*x)
y2=10*np.sin(2*np.pi*5*x)

In [None]:
fig, axs = plt.subplots(2,1)

Above we instantiated a fig and axs object used to control a figure. The diagram below highlights what these two objects control:

![](img/axes_matplotlib.png)

`.tight_layout()` is an example of a fig method. 

In [None]:
fig, axs = plt.subplots(2,1)
fig.tight_layout()

Note that since there are multiple figures in this plot, there are multiple axes objects we can control.

In [None]:
axs.shape,axs

In [None]:
fig, axs = plt.subplots(2,1)
axs[0].plot(x,y1,'r-')
axs[1].plot(x,y2,'k--')
axs[1].set_xlabel('This is X label')
axs[0].set_ylabel('This is Y label 1')
axs[1].set_ylabel('This is Y label 2')
fig.tight_layout();

In [None]:
random_image = np.random.rand(50,50)

In [None]:
fig,ax=plt.subplots()
ax.imshow(random_image);

<a id='pandas'></a>

# Pandas 

- Python package that builds upon numpy arrays
- Useful in data analysis
- Common manipulations of arrays
- [Pandas Documentation](http://pandas.pydata.org/)

Two main data structures in Pandas are:

- Series (1D array; column of a table)
    * Useful for sequence data like time series 
- Dataframes (2D array where each column has its own data type; table of data)
    * Useful for spreadsheet like data

<img src='https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fstorage.googleapis.com%2Flds-media%2Fimages%2Fseries-and-dataframe.width-1200.png&f=1&nofb=' width=600 />

Below we will cover how to create and manipulate both data structures.

## Series 

In [None]:
import pandas as pd
from pandas import Series, DataFrame

We can create a series from a python list with data.

In [None]:
test_series = Series([1, 2, 4, -7, -10 , 20])
test_series

Pandas is built on top of numpy. We can use the `.values` attribute to convert the series into a numpy array.

In [None]:
test_series.values

Unlike arrays, we can customize the index and access elements via the new index.

In [None]:
test_series = Series([1, 2, 4, -7, -10, 20], index=["x", "y", "a", "d", "e", "c"])
test_series

In [None]:
test_series['d']

We can slice the index by passing a list of indicies.

In [None]:
test_series[['x', 'd', 'e']]

Scalar multiplication distributes through the series. 

In [None]:
test_series*2

We can also create a Series using a dictionary that specifies the index (key).  

In [None]:
test_dictionary = dict({"one":1, "two":2, "three":3, "four":4})
test_series = Series(test_dictionary)

print(test_series)

If we add in an additional index with no value associated with it, pandas will and a NaN Value. 

In [None]:
#data with null/NA values
test_index = ["one", "two", "three", "four", "five"]
test_series = Series(test_dictionary, index=test_index)
print(test_series)

We can find where null values do or do not exist in a Series with `isnull()` and `notnull()`

In [None]:
pd.isnull(test_series)

In [None]:
pd.notnull(test_series)

We can sum series.  Let's try it out.

In [None]:
test_series

In [None]:
test_series_2 = Series({"one":1, "two":2, "three":3, "five":5})
test_series_2

In [None]:
test_series + test_series_2

## Dataframes

We can create dataframes from multiple python datatypes.  Let's start with data stored in a dictionary where the key is a column name and the values are list of data. 

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

In [None]:
test_data_frame = DataFrame(data)
test_data_frame

We can change the order of the columns by passing in the order.

In [None]:
# Change column sequence
test_data_frame = DataFrame(data, columns=['year', 'state', 'pop'])
test_data_frame

If we pass a column name that does not exist in our data, the column will become NaN.  

In [None]:
test_data_frame = DataFrame(data, columns=['year', 'state', 'pop','debt'])
test_data_frame

We can also specify index, or row names.

In [None]:
# Change row names just like in Series
test_data_frame = DataFrame(data, columns=['year', 'state', 'pop'], 
                            index=['row one', 'row two', 'row three', 'row four', 'row five'])
test_data_frame

I can use a list of column names to query a dataframe for specific columns

In [None]:
test_data_frame[['year', 'state']]

I can assign values to columns in multiple ways

In [None]:
test_data_frame['debt'] = 25
test_data_frame

In [None]:
# Assigning a column a vector value
test_data_frame['debt'] = np.arange(1,6,1)
test_data_frame

In [None]:
# pass a series
new_data = Series([1.2, 2.4, 5.7],index=['row two', 'row three', 'row five'])
test_data_frame['debt'] = new_data
test_data_frame

In [None]:
# Row shuffle
test_data_frame.reindex(['row five', 'row four', 'row three', 'row two', 'row one'])

We can do mathematical operations like summing rows and columns 

In [None]:
test_data_frame.sum() # sum columns

In [None]:
test_data_frame.sum(axis=1,numeric_only=True) # sum rows, skips over nonnumeric cells

We can also join two dataframes together. 

In [None]:

data_frame_1 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 
                          'data1': range(7)})
data_frame_1

In [None]:
data_frame_2 = DataFrame({'rkey': ['a', 'b', 'd'],
                          'data2': range(3)})
data_frame_2

In [None]:
pd.merge(data_frame_1, data_frame_2,left_on='lkey', right_on='rkey')

<a id='scikit-learn'></a>

# Scikit-Learn 

Scikit-learn is a machine-learning library for Python.  It includes a variety of machine learning learning algorithms for classification, regression, clustering, etc. 

Below we demos some of the basics functionality of scikit-learn.  

## Simple Linear Regression

We will start with a demo of building a simple linear regression model.  Let's start by importing the needed libraries from sklearn.

In [None]:
from sklearn.linear_model import LinearRegression

If you get an error message you may need to install sklearn.  You can do that by uncommenting the line below. 

In [None]:
#! pip3 install --user scikit-learn

Next, lets use numpy to create random data that has a linear relationship where the slope is 0.5 and the y-intercept is 3.  Two data sets are built for training and testing purposes. 

In [None]:
slope = 0.5
intercept = 3

X = np.linspace(0,3,30)
y = slope * X + intercept + 0.1*np.random.normal(size=len(X))
X_test = np.linspace(0,3,20)
y_test = slope * X_test + intercept + 0.1*np.random.normal(size=len(X_test))

Below we plot this data:

In [None]:
fig,ax = plt.subplots()
ax.scatter(X,y,c='r',label='train data')
ax.scatter(X_test,y_test,c='b',label='test data')
ax.set_xlabel('Features')
ax.set_ylabel('Target')
ax.legend();

Next we will need to instantiate the object for the model we want to build.  In this first demo, we will use linear regression.

In [None]:
reg = LinearRegression()

Next we fit the model to the training data by calling the method 'fit()' and passing in the training data. 

In [None]:
reg.fit(X.reshape(-1,1),y)

Once our model is fit we can view attributes associated with the linear regression model such as the slope and intercept:

In [None]:
reg.coef_ # slope

In [None]:
reg.intercept_ # intercept 

Finally we can use our model to make predictions by using the `predict()` method.

In [None]:
reg.predict(np.array([[1.5]]))

In [None]:
fig,ax = plt.subplots()
ax.scatter(X,y,c='r',label='train data')
ax.scatter(X_test,y_test,c='b',label='test data')
ax.plot(X,reg.predict(X.reshape(-1,1)),'k-',label='prediction')
ax.set_xlabel('Features')
ax.set_ylabel('Target')
ax.legend();

scikit-learn also has many helpful functions for evaluating models.  Below we demo code for computing the $R^2$ value of our linear regression model. 

In [None]:
from sklearn.metrics import r2_score

yhat = reg.predict(X_test.reshape(-1,1))
print('The R squared valued is {}'.format(r2_score(y_test,yhat)))

## Random Forest

The syntax used to build other models is very similar to linear regression.  To demo this, we use the same random data generated above and build a Random Forest model.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
regr = RandomForestRegressor(max_depth=2, random_state=0)

In [None]:
regr.fit(X.reshape(-1,1),y)

In [None]:
regr.predict(np.array([[1.5]]))

In [None]:
fig,ax = plt.subplots()
ax.scatter(X,y,c='r',label='train data')
ax.scatter(X_test,y_test,c='b',label='test data')
ax.plot(X,reg.predict(X.reshape(-1,1)),'k-',label='LR prediction')
ax.plot(X,regr.predict(X.reshape(-1,1)),'k--',label='RF prediction')
ax.set_xlabel('Features')
ax.set_ylabel('Target')
ax.legend();

In [None]:
yhat = regr.predict(X_test.reshape(-1,1))
print('The R squared valued is {}'.format(r2_score(y_test,yhat)))

<a id='dask'></a>

# Dask 

Dask is a python library for parallel computation.  Dask is built on top of the already existing python data science ecosystem including the popular libraries such as pandas, numpy, and sklearn making it easy to use for python users, while also allowing users to scale up their code to leverage high performance computing environments.  Dask has high and low level interfaces:

- High level collections: Array, Bags, and Dataframes that mimic the python ecosystem (pandas, numpy, etc.) but also parallelizes the code for data that doesn't fit in memory.
- Low level schedulers: task schedulers that allow users to leverage HPC environments.

To learn more about Dask checkout [this dask tutorial](https://github.com/dask/dask-tutorial).    