# Step 4 - Analysis and Reporting

So far, we have covered the necessary steps to prepare the construction of your first Machine Learning model. Now it is time to use data and build our first model. 

By the end of this module, you will be able to

* Prepare your data and rrain your first Machine Learning model
* Make a first performance evaluation

**NOTE:** This module is intended to provide an introductory approach to machine learning modeling, training and evaluation. This process is covered with more detail and content in the **Machine Learning Methods** courselet. 

In [None]:
# We begin by importing the necessary packages
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo 
from scipy import stats
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score

For this module, we are going to work with the same data from the previous module, the credit approval data from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/). We are going to proceed with data preprocessing. We are not going to detail much as this step was covered in the previous module.

In [None]:
# load our data https://archive.ics.uci.edu/dataset/27/credit+approval
credit_approval = fetch_ucirepo(id=27) 

# DATA PRE-PROCESSING

# data (as pandas dataframes) 
X = credit_approval.data.features 
y = credit_approval.data.targets

# DATA PRE-PROCESSING

# Drop missing values
X_clean = X.dropna()
floats = X_clean.select_dtypes(include=['float']) # We identify the columns with floating point values

# Handling outliers 
threshold = 3
valid_rows = [] # This list will store the rows with missing values
for x in floats:
    z_score = stats.zscore(X_clean[x])
    valid = abs(z_score) <= threshold
    valid_rows.append(valid)
# Combine the outlier masks for all columns
combined = pd.concat(valid_rows, axis=1).all(axis=1)
combined
# Remove rows with outliers
X_clean = X_clean[combined]

# Dummies
categorical = X_clean.select_dtypes(include=['object']).columns# First, we identify the non-numerical features
X_dummies = pd.get_dummies(X_clean[categorical], prefix=categorical) # Create dummies
X_clean = pd.concat([X_clean, X_dummies], axis=1) # Concatenate dataframes
X_clean.drop(categorical, axis=1, inplace=True) # Keep only the dummies, as well as the original continous features

# We need to keep the same indices for y
y_clean = y.loc[X_clean.index]
y_clean = np.where(y_clean=="+", 1,0) # Transforming our y_clean into a binary feature
y_clean = sklearn.utils.validation.column_or_1d(y_clean,warn=True) # Keeping it as a 1D array. You can ignore the red warning.

## Splitting our data

Once our data is clean and ready to be used, the first thing we have to do is to separate our data into two components: the training and the testing datasets. We use the training set (X_train and y_train) as an input to train a model and develop the parameters of the model. Once we have trained a model, we use our model to make predictions using the testing features dataset (X_test) and compare the predictions (y_pred) with the predictions with the actual values of our target feature (y_test). The Python library [scikit-learn](https://scikit-learn.org/stable/index.html) provides a convenient function for this task.

In [None]:
# Splitting our data - We are using 20% of the data (test_size=0.2) as our testing sample
X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, test_size=0.2, random_state=42)

## Running a model

Now it is time to run our first ML model, evaluate it and report results. We are not going to cover multiple essential detail regading the different types of models and the evaluation metrics of these models. Those topics are going to be covered in a different courselet.

For this task, we are going to run a simple [Perceptron](https://en.wikipedia.org/wiki/Perceptron) model, which is a binary classifier. We are not going to cover the details of this model (that's a topic for a different courselet), but in general terms, what this model tries to achieve is to find the linear function that better separates our data. We are going to use the predefined [scikit-learn class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html).

In [None]:
## Building the model
clf = Perceptron()
clf.fit(X_train, y_train) # We train the model using our training data.

## Making Predictions

Once our model has been trained, the main goal of our process is to make predictions. We can use the method *predict* to create an array of predictions using the features dataset we reserved for testing.

In [None]:
# Making predictions
y_pred = clf.predict(X_test)

## Evaluating the model

Now our model is trained, and we have made predictions with it, is time to see how those predictions compare with the actual values of the y_test array. We can use the *accuracy_score* from scikit-learn to make this evaluation, by taking the division between the number of correct predictions and the size of y_test. 

In [None]:
# Scoring
accuracy_score(y_test,y_pred)

As we can see, our model did not perform greatly. An accuracy of 60% is not much better than a random choice, which by default would give us an accuracy of 50%. There are certainly other models and techniques that could help us obtain a better accuracy. Furthermore, it is also possible that accuracy might not be our target metric, and we care more about other metrics like precision. All of this are topics for a different courselet. What we have tried to accomplish is to give you the opportunity to run your first model and evaluate.

Now that we have collected data, preprocessed it, trained a model and evaluated it, and assuming we are confident with the results of our current model, what's next? Our very next step is to deploy it and take actions. 

### Hands-On

In the following cell, try to achieve a better version of the previous model by doing some feauture selection process. Train a new version of the Perceptron model and compare the accuracy with the previous one.

In [None]:
# Your Code HERE