Cheatsheet - Data Modelling Process
The following should help provide guidance for completing the exam.

1. Import data - you will need to use the pd.read_csv() function to import the provided data set

2. Dataframe checks - use df.info() and df.describe() to understand your data

    using df.info(), look for the following: 
        missing data
        incorrect data type (i.e., a column looks like a number but the data type in df.info() says it's an object) > this means there are data issues (special characters, text where it shouldn't be, etc.)

using df.describe(), look for the following:
    do the ranges make sense?
    are there outliers?
    which variables are categorical (even if there is a number indicator, like 1,2,3)
    which variables are continuous

3. Data cleaning

    clean missing values
        check how much of the TOTAL data is missing. If less than 10% of the total data points, then drop the missing data
        if more than 10%:
            Compare filling methods: Average/Median of the column OR Average/median of the column based on a category. Select the method that doesn't cause drastic differences in results
            Explain your choice. Make sure that you have checked both methods before finalizing your choice. 

4. Data exploration

    Check EACH variable thoroughly
        
        for continuous variables: plot the boxplots with the x-axis as the output variable, and y-axis as the continuous variable
            You are looking for difference in spread of data. For example, if output category 0 has a spread of 10-50, but output category 1 has a spread of 25-70, obviously the continuous variable depends differently depending on the output. So this would indicate that this variable is meaningful to the analysis.

        for categorical variables: use groupby() to check proportions
            Check whether the independent variable categories have different proportions to the dependent variable. For example, when comparing the titanic survival by gender, the first categorical variable is Gender and we're comparing it to a categorical output "Survived" - if we see that a higher proportion of Survivors are Women,  vs. the proportion of survivors who were women, then we can conclude gender has something to do with survival.

5. Feature engineering

    Generate new features from the data you have. This could include:
        binning (i.e., translate a continuous variable into groups like 5-10, 15-20, etc.)
        dummy variables (use one hot encoding, or pandas get_dummies() function to convert categorical variables to dummies)
        define new metrics (e.g., multiply columns together or create custom categories based on multiple variables)
    
    Check the relationship of engineered features to the output variable, using the methods outlined above

6. Data model

    build a baseline set of models using the data from your exploratory analysis which seems to have a relationship with the output variable. For a categorical problem, build a baseline model from each model we learned in class > Logistic Regression, Naive Bayes, KNN, SVM, Decision Tree. Use the cross validation function to run each model 10 times and calculate an average performance. Remember to use F1 score in the cross validation function.
    
    Select the model that has the best F1 score overall. Now you will work on optimizing this model.
        Decide whether you want to optimize based on Precision or Recall. You will need to explain your choice in relation to the business objective. 
        Tune the model using hyperparameter tuning. If your computer doesn't have much memory, use RandomSearchCV with the n_iter parameter set to a low number like 100
        Check performance metrics: precision, recall, F1 score, roc_auc, accuracy score
        If your model isn't performing well, try adding/removing variables, or engineering more features
    
    Once you're confident you have tried everything possible to build the best possible model, you can finalize your model

7. Explaining your results

    Explain your results by always connecting your decisions to the business problem provided.
    
    Explain your process > how did you decide to clean the data? how do you know that was the best decision? what did you learn from exploratory analysis, and how did this inform your modelling decisions? Which features did you engineer? how did you pick these? 
    
    Explain the final model > which variables were included in the model, and what was the model performance?
    
    Explain how this model can be applied to the future to predict outcomes > how will the business use the model? 