# Summary Steps

## Data Load

* Use pandas to load csv

## Data analysis

* Chart and Plotting to see data shape
* Identify missing data, invalid values
* Dataframe Description
* Dataframe.describe()
* Unique values: species = iris_ds_file["Species"].unique()

## Preprocessing

* Replace invalid values with NaN
    * df.column.replace(0, np.nan, inplace=True)
    * df[df == '?'] = np.nan
    * df.isnull().sum()
    * df.dropna()
* Imputer
    * sklearn.preprocessing.Imputer
    
    ```python
        imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
        imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
        #axis = 0 -> columns & 1 -> rows
        imp.fit(X)
        X = imp.transform(X)

    ```
* Transform categories into numeric. Target Categories for classification problems
    * sklearn.preprocessing.LabelEncoder

    ```python
        species_lbl = LabelEncoder()
        iris_ds_file["Species_code"] = species_lbl.fit_transform(iris_ds_file["Species"])
    ```

* Dummies, for regression problems and we need to transform categorical features into numeric features

    ```python
        iris_ds_dummies = pd.get_dummies(iris_ds_file, drop_first=True)
        print("Iris Columns: {}".format(iris_ds_dummies.columns))
    ```
* Scaling Data
    ```python
        from sklearn.preprocessing import scale
        X_scale = scale(X)
    ```

## ML Algorithm Selection


## Creating Model

### Model Selection

* sklearn.model_selection.train_test_split
* sklearn.model_selection.GridSearchCV - Few parameters to configure
    ```python
        from sklearn.model_selection import GridSearchCV
        param_grid = {'n_neighbors': np.arange(1,50)}
        knn = KNeighborsClassifier()
        knn_cv = GridSearchCV(knn, param_grid=param_grid, cv=5)
        knn_cv.fit(X_train, y_train)
        print('Best parameters: {}'.format(knn_cv.best_params_))
        print('Best Score {}'.format(knn_cv.best_score_))
    ```
* sklearn.model_selection.RandomizedSearchCV - Many parameters to configure
    ```python
        # Import necessary modules
        from scipy.stats import randint
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.model_selection import RandomizedSearchCV

        # Setup the parameters and distributions to sample from: param_dist
        param_dist = {"max_depth": [3, None],
                    "max_features": randint(1, 9),
                    "min_samples_leaf": randint(1, 9),
                    "criterion": ["gini", "entropy"]}

        # Instantiate a Decision Tree classifier: tree
        tree = DecisionTreeClassifier()

        # Instantiate the RandomizedSearchCV object: tree_cv
        tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

        # Fit it to the data
        tree_cv.fit(X, y)

        # Print the tuned parameters and score
        print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
        print("Best score is {}".format(tree_cv.best_score_))
    ```

### Models

#### Classification

* sklearn.neighbors.KNeighborsClassifier
* sklearn.tree.DecisionTreeClassifier
* sklearn.svm.SVC

#### Regression

#### Execution

* ML model execution

    ```python
        knn = KNeighborsClassifier(n_neighbors=neighbors)
        knn.fit(X_train, y_train)
        y_predict = knn.predict(X_test)
    ```

* Pipelines - link steps in the preprocessing and execution steps
    * sklearn.pipeline.Pipeline

    ```python
        from sklearn.preprocessing import Imputer
        from sklearn.pipeline import Pipeline
        from sklearn.svm import SVC

        # Setup the pipeline steps: steps
        steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
                ('SVM', SVC())]

        # Create the pipeline: pipeline
        pipeline = Pipeline(steps)

        # Create training and test sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

        # Fit the pipeline to the train set
        pipeline.fit(X_train, y_train)

        # Predict the labels of the test set
        y_pred = pipeline.predict(X_test)

        # Compute metrics
        print(classification_report(y_test, y_pred))
    ```

## Evaluating Model

### Metrics

* Accuracy = (tp + tn) / [tp + tn + fp + fn]
* Precision = tp / (tp + fp)
* Recall = tp / (tp + fn)
* F1 Score = 2 [(precision * recall) / (precision + recall)]

High Accuracy = Predecting correctly the expected value
High Precision = We are classifiying correctly the object of interest
High Recall = We are classifying correctly when it is not the object of interest

```python
    accuracy knn.score(X_test, y_test)
```

### Best Parameter for the model

#### Manual approach

Identify the best k for test and training data

```python
    neighbors = np.arange(1,9)
    train_accuracy = np.empty(len(neighbors))
    test_accuracy = np.empty(len(neighbors))

    for idx, k_value in enumerate(neighbors):
        knn = KNeighborsClassifier(n_neighbors=k_value)
        knn.fit(X_train, y_train)
        train_accuracy[idx] = knn.score(X_train, y_train)
        test_accuracy[idx] = knn.score(X_test, y_test)

    #Plotting
    plt.title("k-NN: Varying Number of Neighbors")
    plt.plot(neighbors, test_accuracy, label= 'Test accuracy', )
    plt.plot(neighbors, train_accuracy, label='Train accuracy')
    plt.legend()
    plt.xlabel("Number of neighbors")
    plt.ylabel("Accuracy")
    # plt.axes([0.96,1,0,9])
    plt.show()
```

#### Cross Validation

```python
    from sklearn.model_selection import cross_val_score
    # Let's use k=5
    knn = KNeighborsClassifier(n_neighbors=5)
    cv_result = cross_val_score(knn, X_train, y_train, cv=5)
    print("Cross-validation result: {}".format(cv_result))
    print("Cross-validation mean: {}".format(np.mean(cv_result)))
```

#### Confusion Matrix

```python
    from sklearn.metrics import classification_report, confusion_matrix

    print("Confusion Matrix")
    print(confusion_matrix(y_test, y_predict))
    print("Classification Report")
    print(classification_report(y_test, y_predict))
```
Confusion Matrix
[[10  0  0]
 [ 0 12  1]
 [ 0  0  7]]
Classification Report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.92      0.96        13
           2       0.88      1.00      0.93         7

    accuracy                           0.97        30
   macro avg       0.96      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30

#### ROC (ONLY BINARY CASES)

> Checks the probability

```python
    from sklearn.metrics import roc_curve
    knn.fit(X_train, y_train)

    # Probability first line

    y_predict_prob = knn.predict_proba(X_test)[:,1]
    print(type(y_predict_prob))
    print(type(y_test))

    #Generate ROC curve values: fpr, tpr, thresholds
    fpr, tpr, thresholds = roc_curve(y_test, y_predict_prob)

    # Plot ROC curve
    plt.plot([0, 1], [0, 1], 'k--')
    plt.plot(fpr, tpr)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.show()
```

#### Area under the curve (AUC)

Larger area under the curve is a better model

```python
    from sklearn.metrics import roc_auc_score
    roc_auc_score(y_test, y_predict_prob)
    # The closest to 1 the better
```

##### AUC using Cross-validation
```python
cv_scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='roc_auc')
```