<img src="/Users/miguelfrutossoriano/Desktop/git/Notebook-Competition/assets/miguel_frutos.jpg"  width=150, height=200 align="right" form= 'circle' />

# DATA SCIENCE NOTEBOOK COMPETITION


<a id="ch2"></a>
# A Data Science Framework

1. **Frame the Problem:** If data science, big data, machine learning, predictive analytics, business intelligence, or any other buzzword is the solution, then what is the problem? As the saying goes, don't put the cart before the horse. Problems before requirements, requirements before solutions, solutions before design, and design before technology. Too often we are quick to jump on the new shiny technology, tool, or algorithm before determining the actual problem we are trying to solve. Look at the Big Picture, define the problem, The first question to ask your boss is what exactly is the business objective; building a model is probably not the end goal, do we even need a ML model to find a solution? . How does the company expect to use and benefit from this model? This is important because it will determine how you frame the problem, what algorithms you will select, what performance measure you will use to evaluate your model, and how much effort you should spend tweaking it.

2. **Gather the Data:** John Naisbitt wrote in his 1984 (yes, 1984) book Megatrends, we are “drowning in data, yet staving for knowledge." So, chances are, the dataset(s) already exist somewhere, in some format. It may be external or internal, structured or unstructured, static or streamed, objective or subjective, etc. As the saying goes, you don't have to reinvent the wheel, you just have to know where to find it. In the next step, we will worry about transforming "dirty data" to "clean data."

3. **Prepare Data for Consumption:** This step is often referred to as data wrangling, a required process to turn “wild” data into “manageable” data. Data wrangling includes implementing data architectures for storage and processing, developing data governance standards for quality and control, data extraction (i.e. ETL and web scraping), and data cleaning to identify aberrant, missing, or outlier data points.

4. **Perform Exploratory Analysis:** Anybody who has ever worked with data knows, garbage-in, garbage-out (GIGO). Therefore, it is important to deploy descriptive and graphical statistics to look for potential problems, patterns, classifications, correlations and comparisons in the dataset. In addition, data categorization (i.e. qualitative vs quantitative) is also important to understand and select the correct hypothesis test or data model.

5. **Model Data:** Like descriptive and inferential statistics, data modeling can either summarize the data or predict future outcomes. Your dataset and expected results, will determine the algorithms available for use. It's important to remember, algorithms are tools and not magical wands or silver bullets. You must still be the master craft (wo)man that knows how-to select the right tool for the job. An analogy would be asking someone to hand you a Philip screwdriver, and they hand you a flathead screwdriver or worst a hammer. At best, it shows a complete lack of understanding. At worst, it makes completing the project impossible. The same is true in data modelling. The wrong model can lead to poor performance at best and the wrong conclusion (that’s used as actionable intelligence) at worst.

6. **Validate and Implement Data Model:** After you've trained your model based on a subset of your data, it's time to test your model. This helps ensure you haven't overfit your model or made it so specific to the selected subset, that it does not accurately fit another subset from the same dataset. In this step we determine if our [model overfit, generalize, or underfit our dataset](http://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html).

7. **Optimize and Strategize:** This is the "bionic man" step, where you iterate back through the process to make it better...stronger...faster than it was before. As a data scientist, your strategy should be to outsource developer operations and application plumbing, so you have more time to focus on recommendations and design. Once you're able to package your ideas, this becomes your “currency exchange" rate.

# Resources
 - [LD Freeman - A Data Science Framework: To Achieve 99% Accuracy](https://www.kaggle.com/code/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy)

# Import libraries

We will need to use some modules to leverage the ML ecosystem. Check out in below cell, which libraries we will implement. if needed, install the libraries with the following command in your terminal.

```pip install -r /path/to/requirements.txt```

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd

# Math & Stats packages
import math
from math import sqrt
import scipy.stats as stats
from scipy.stats import yeojohnson
from scipy.stats import kruskal
import statsmodels.api as sm
from statsmodels.api import add_constant

# Random package
import random
from random import seed
from random import randrange

# Sklearn: Prime ML library
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KernelDensity
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn import svm, datasets
from sklearn.svm import SVC
from sklearn.datasets import make_blobs
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer, SimpleImputer
from sklearn_pandas import  DataFrameMapper
from sklearn_pandas import DataFrameMapper
from sklearn.feature_selection import RFE
from sklearn import datasets, linear_model
from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.manifold import TSNE
from sklearn.neighbors import KNeighborsClassifier, KernelDensity, DistanceMetric
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
from sklearn.linear_model import LogisticRegression
from sklearn.exceptions import DataConversionWarning
from sklearn.utils.validation import column_or_1d
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier,\
                                    IsolationForest, ExtraTreesRegressor
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder,\
                                    LabelEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split,\
                                    GridSearchCV, learning_curve, validation_curve, ShuffleSplit, \
                                    LeaveOneOut, KFold, StratifiedShuffleSplit, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, recall_score, matthews_corrcoef, \
                                    mean_squared_error, r2_score, mean_absolute_error, \
                                    roc_auc_score, roc_curve, accuracy_score, \
                                    confusion_matrix, multilabel_confusion_matrix, classification_report

# Tensorflow & Keras for DL
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.python.framework import ops
ops.reset_default_graph()
from ann_visualizer.visualize import ann_viz
from keras.models import Sequential
from keras.layers import Dense
from keras import models, layers  
from keras_visualizer import visualizer  


# Visualization
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from matplotlib import pyplot
from matplotlib import colors
from matplotlib.colors import ListedColormap
import seaborn as sns; sns.set()
import pydotplus
import graphviz
import visualizer

# Get a copy of data
from copy import deepcopy
from copy import copy

# Ignore some warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings('ignore')

# Missing values
import missingno as msno  # # pip install missingno

# Special Dtypes
from collections import defaultdict
from collections import OrderedDict
from collections import Counter

# Factor Analysis
from factor_analyzer import FactorAnalyzer

# Progress Bar
from tqdm import tqdm_notebook

# For Markdown visualization
from IPython.display import Image, display, Math, Latex

#Over/Under Sampling
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import NearMiss