# Data Science with Python: Week 5

Alexander L. Hayes

## numpy

* Scientific computing library that provides a powerful array object.
* Building block for virtually all scientific computing and machine learning libraries in Python.
* Provides versatile utilities to interface with C++/Fortran code.

In [1]:
import numpy as np

# Create Arrays
a = np.array([1, 2, 3])
b = np.array([2, 3, 5, 7])

# Python Type
print(type(a))

# System Type
print(a.dtype)

# Size:   size(a) or a.size
print(a.size)

# Shape:  shape(a) or a.shape
print(a.shape)

# Multiply contents of an array by a constant:
# Multiply all elements in array a by 3.
print(3 * a)

# Adding arrays together:
print(a + (3 * a))

# Apply functions to arrays:
print(np.exp(a))

# We can see the bytes per item:
print(a.itemsize)

# Bytes for a whole array:
print(a.nbytes)

# Set all items in a numpy array to 0:
b.fill(0)
print(b)

# You can do this with slicing as well:
b[:] = 1
print(b)

<class 'numpy.ndarray'>
int64
3
(3,)
[3 6 9]
[ 4  8 12]
[  2.71828183   7.3890561   20.08553692]
8
24
[0 0 0 0]
[1 1 1 1]


In [2]:
# "Where" function
# Flattening arrays
# Reshape arrays by copying (so long as the dimensions match)
# Reduce
# Outer
# Broadcasting: Operating on arrays where dimensions do not match?

## pandas

* http://pandas.pydata.com
* "high-performance, easy-to-use data structures and data analysis tools."
* similar to abilities to SQL
* Joins and merges, groupby, automatic plotting, multi-level indices, time series operations.

Data structures include "series" (1-dimensional) and "data frames" (2-dimensional).

In [3]:
import pandas as pd

pd.Series([1, 2, 3, 5, 7, np.NaN])

0    1.0
1    2.0
2    3.0
3    5.0
4    7.0
5    NaN
dtype: float64

In [4]:
dates = pd.date_range('20171001', periods=10)

df = pd.DataFrame(np.random.randn(10, 6), index=dates, columns=pd.Series(['a', 'b', 'c', 'd', 'e', 'f']))

print(df.head(2)) # First two rows
print('...')
print(df.tail(2)) # Last two rows

print('\nPerhaps we are only interested in the columsns:\n', df.columns)
print('\nPerhaps we just want the values in the data itself:\n', df.values)

# Sort by index or values:
# missing

                   a         b         c         d         e         f
2017-10-01 -0.485777 -0.809234  0.261354 -1.403313  0.744545 -2.670676
2017-10-02  0.617614  0.301468 -1.495759  0.595042  1.654179 -0.512459
...
                   a         b         c         d         e         f
2017-10-09  0.735101  0.569015  0.484079 -0.044173 -0.383211 -0.660421
2017-10-10  0.399839 -1.071414 -0.090127  1.332709  0.619105 -1.375219

Perhaps we are only interested in the columsns:
 Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

Perhaps we just want the values in the data itself:
 [[-0.48577668 -0.80923436  0.26135356 -1.40331315  0.74454492 -2.67067599]
 [ 0.61761395  0.30146776 -1.49575912  0.59504248  1.65417894 -0.51245857]
 [-0.46160658  0.14182487 -0.25288841 -0.2391479   1.02215822 -1.52768387]
 [ 0.0045966   2.55513003 -1.20052018  0.33880425 -0.51175176 -0.14483781]
 [-1.45706758 -3.45559105 -0.38345109 -1.858941   -0.73022825 -1.60549361]
 [ 1.16085233 -0.232811    0.88262134

## scikit-learn

* Optimized and easy-to-use implementations of common machine learning algorithms.
* Built on scipy, numpy, and pandas.
* A good model requires good features; features should strongly correlate with outcomes, be relatively independent of each other, and fit on the same scale. Typically this requires domain expertise.
* Three basic types of algorithms: classification, regression, and clustering.

For demonstration, we will use the classic Iris dataset (it is included in scikitlearn), which includes 150 samples each with four features (sepal length, sepal width, petal length, and petal width), classified by species (three).

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()

indices = np.arange(150)
np.random.shuffle(indices)

labels = iris.target_names
(train_data, test_data, train_labels, test_labels) = train_test_split(iris.data, iris.target, test_size=0.3)

model = DecisionTreeClassifier(min_samples_split=7, min_samples_leaf=3)
model.fit(train_data, train_labels)

predicted_labels = model.predict(test_data)
print(accuracy_score(predicted_labels, test_labels))

0.888888888889
