# A basic example
**What is this drink? A Beer or a Wine?**  
**Task**: Create a question answering system (a model) via training  
**Goal**: Create an accurate model that answers our questions correctly most of the time.

In [None]:
# We import libraries that we will use to help to create a model
import matplotlib.pyplot as plt
import pandas
from sklearn import datasets, linear_model
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_predict

In [None]:
# Defining a function to convert string value of a drink name to numerical value
def drink(value):
    if value == "Beer":
        return 1
    else:
        return 0
def determineDrink(value):
    if value > 0.7:
        return "Beer"
    elif value < 0.3:
        return "Wine"
    else:
        "Not sure"

Selecting the aspects (features) of data
- the color (as rgb hex representation)
- the alcohol content (as a percentage %)

## First step
![alt text](shop.jpg)

## 1. Gathering the data
The Quality and Quantity of the data will determine how good the predictive model can be


In [None]:
# Load dataset
orig_dataframe = pandas.read_csv("beer-wine.csv", header=None)
display(orig_dataframe)

## 2. Data Preparation
Randomize data - the order of drinks should not affect the predictions

In [None]:
dataframe = orig_dataframe.sample(frac=1)

It is good time to visualize data - check for correlations and imbalances

In [None]:
with plt.xkcd():
    # Based on "The Data So Far" from XKCD by Randall Monroe
    # http://xkcd.com/373/
    fig = plt.figure()
    ax = fig.add_axes((0.2, 0.2, 0.8, 0.7))
    ax.bar([0, 0], [0, orig_dataframe[2].value_counts()["Beer"]], 0.25)
    ax.bar([0, 1], [0, orig_dataframe[2].value_counts()["Wine"]], 0.25)
    ax.bar([2, 2], [0, 50], 0.25)
    ax.spines['right'].set_color('none')
    ax.spines['top'].set_color('none')
    ax.xaxis.set_ticks_position('bottom')
    ax.set_xticks([0, 1, 2])
    ax.set_xlim([-0.5, 3])
    ax.set_ylim([0, 3])
    ax.set_xticklabels(['BEER', 'WINE', "ALL OTHER DRINKS\nTO TRY"])
    plt.yticks([5,10,15,30])
    plt.title("Drinks that we measuered")
    
plt.show()

Sometimes data adjusting and manipulation:
- Normalization
- Duplication
- Error correction
- etc.  

In this case - converting string into numbers

In [None]:
dataframe[1] = dataframe[1].apply(lambda x: int(x, 16)) # Converts RGB hex code (#FF0000 to numerical value)
dataframe[2] = dataframe[2].apply(lambda x: drink(x)) # Converts "Beer" to 1; "Wine" to 0

In [None]:
display(dataframe)

Splitting data into two data sets:
- ~70% for model training
- ~30% for testing the model

In [None]:
# Split training data
df_training = dataframe[:16]
dataset = df_training.values
# Split into input (X) and output (Y) variables
train_X = dataset[:,0:2]
train_Y = dataset[:,2]

In [None]:
# Split test data
df_test = dataframe[16:20]
dataset_test = df_test.values
test_X = dataset_test[:,0:2]
test_Y = dataset_test[:,2]

## 3. Choosing a model
Lots of models created over the years for different purposes:
- Decision trees
- Logistic regression
- Neural networks
- k-means
- Linear regression
- etc. etc. etc.  

Linear regression is used to determine the extent to which there is a linear relationship between a dependent variable and one or more independent variables.

In [None]:
model = linear_model.LinearRegression(copy_X=True, 
                                     fit_intercept=True,
                                     n_jobs=5, 
                                     normalize=False) # because it is simple, only 2 features - gets the job done

## 4. Training the model
- Consumes most of the time in real ML projects
- Bulk of ML process  

Adjust weights of features to suggest a prediction


In [None]:
# Train model
model.fit(train_X, train_Y)

## 5. Model evaluation
Testing the model with data that was set aside (30%) - to check how it will perform with unseen data.

In [None]:
display(test_X)
display(test_Y)

In [None]:
# Make predictions using the testing set
y_pred = model.predict(test_X)

print('Model prediction: %.2f' % r2_score(test_Y, y_pred)) # coefficient of determination (confidence)

## 6. Parameter tuning
Adjusting parameters might:
- Increase (decrease) accuracy
- Increase (decrease) model build time  

You decide when the model is good enough!

## 7. Prediction
The power of machine learning is that we were able to determine how to differentiate between wine and beer using our model rather than using human judgement and manual rules.

In [None]:
print('What drink is it? : ' + determineDrink(model.predict([[  float(11.5),   float(int("7F1500", 16))]])))
print('What drink is it? : ' + determineDrink(model.predict([[  float(5.3),   float(int("B58A00", 16))]])))

## 7 Steps of Machine learning process:
- Gathering data
- Preparing that data
- Choosing a model
- Training
- Evaluation
- Parameter tuning
- Prediction

