# Example: Working with the Tennis data set

Classification with scikit-learn consists of the following high level stages
1. Transform the data into a numpy array of numbers: a *feature matrix* and a *target vector*
2. Select an estimator (either a classifier or regresor)
3. Train the estimator (model) with `model.fit(X_train, y_train)`
4. Assess the performance of the trained model(scikit provides numerous metrics) using `model.predict(X_test)`
5. Deploy the model to estimate samples whose target is unknown 

## Data in scikit-learn

Data in scikit-learn, is stored as a
**two-dimensional array**, of size `[n_samples, n_features]`. 

- **n_samples:**   The number of samples: each sample is an item to process (e.g. classify).
  A sample can be a document, a picture, a sound, a video, an astronomical object,
  a row in database or CSV file,
  or whatever you can describe with a fixed set of quantitative traits.
- **n_features:**  The number of features or distinct traits that can be used to describe each
  item in a quantitative manner.  Features are generally real-valued, but may be boolean or
  discrete-valued in some cases.
  
Note:  sample, observation, instance are used synonymously in machine learning literature

### Loading the Tennis Data from a file

In [None]:
from pprint import pprint as pp

# data =
# target = 

feature_matrix = []
target_vector = []

for line in open('tennis-data-ints.txt', 'r'):
#    day, f1, f2, f3, f4, target = line.strip().split()
#    print( day, f1, f2, f3, f4, target)
     day, *features, target = line.strip().split()
     features = [ int(f) for f in features]
     feature_matrix.append(features)
     target = int(target)
     target_vector.append(target)
    
pp(feature_matrix)
pp(target_vector)

In [None]:
import numpy as np
data = np.array( feature_matrix)
target = np.array( target_vector)

print(data)
print(target)

## Training and Testing

In [None]:
# from sklearn.model_selection import train_test_split
#r X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.33, random_state=42)

# We will use a variation of NB 
from sklearn.naive_bayes import GaussianNB

# fit a Naive Bayes model to the data
model = GaussianNB()
X_train, y_train = data, target 
model.fit(X_train, y_train)

# make predictions
# sunny cool high strong
X_test =  np.array([[ 2, 0, 1, 1]])

predicted = model.predict(X_test)
predicted

### Testing performance on training data itself ...

In [None]:
y_predicted = model.predict(X_train) 
y_expected = y_train

In [None]:
# Import  metrics
from sklearn import metrics

# summarize the fit of the model

print(metrics.accuracy_score(y_expected, y_predicted))
print()
print(metrics.classification_report(y_expected, y_predicted))
print(metrics.confusion_matrix(y_expected, y_predicted))
print()