Cheatsheet - Data Modelling Process
The following should help provide guidance for completing the exam.

1. Import data - you will need to use the pd.read_csv() function to import the provided data set

2. Dataframe checks - use df.info() and df.describe() to understand your data

    using df.info(), look for the following: 
        missing data
        incorrect data type (i.e., a column looks like a number but the data type in df.info() says it's an object) > this means there are data issues (special characters, text where it shouldn't be, etc.)

using df.describe(), look for the following:
    do the ranges make sense?
    are there outliers?
    which variables are categorical (even if there is a number indicator, like 1,2,3)
    which variables are continuous

3. Data cleaning

    clean missing values
        check how much of the TOTAL data is missing. If less than 10% of the total data points, then drop the missing data
        if more than 10%:
            Compare filling methods: Average/Median of the column OR Average/median of the column based on a category. Select the method that doesn't cause drastic differences in results
            Explain your choice. Make sure that you have checked both methods before finalizing your choice. 

4. Data exploration

    Check EACH variable thoroughly
        
        for continuous variables: plot the boxplots with the x-axis as the output variable, and y-axis as the continuous variable
            You are looking for difference in spread of data. For example, if output category 0 has a spread of 10-50, but output category 1 has a spread of 25-70, obviously the continuous variable depends differently depending on the output. So this would indicate that this variable is meaningful to the analysis.

        for categorical variables: use groupby() to check proportions
            Check whether the independent variable categories have different proportions to the dependent variable. For example, when comparing the titanic survival by gender, the first categorical variable is Gender and we're comparing it to a categorical output "Survived" - if we see that a higher proportion of Survivors are Women,  vs. the proportion of survivors who were women, then we can conclude gender has something to do with survival.

5. Feature engineering

    Generate new features from the data you have. This could include:
        binning (i.e., translate a continuous variable into groups like 5-10, 15-20, etc.)
        dummy variables (use one hot encoding, or pandas get_dummies() function to convert categorical variables to dummies)
        define new metrics (e.g., multiply columns together or create custom categories based on multiple variables)
    
    Check the relationship of engineered features to the output variable, using the methods outlined above

6. Data model

    build a baseline set of models using the data from your exploratory analysis which seems to have a relationship with the output variable. For a categorical problem, build a baseline model from each model we learned in class > Logistic Regression, Naive Bayes, KNN, SVM, Decision Tree. Use the cross validation function to run each model 10 times and calculate an average performance. Remember to use F1 score in the cross validation function.
    
    Select the model that has the best F1 score overall. Now you will work on optimizing this model.
        Decide whether you want to optimize based on Precision or Recall. You will need to explain your choice in relation to the business objective. 
        Tune the model using hyperparameter tuning. If your computer doesn't have much memory, use RandomSearchCV with the n_iter parameter set to a low number like 100
        Check performance metrics: precision, recall, F1 score, roc_auc, accuracy score
        If your model isn't performing well, try adding/removing variables, or engineering more features
    
    Once you're confident you have tried everything possible to build the best possible model, you can finalize your model

7. Explaining your results

    Explain your results by always connecting your decisions to the business problem provided.
    
    Explain your process > how did you decide to clean the data? how do you know that was the best decision? what did you learn from exploratory analysis, and how did this inform your modelling decisions? Which features did you engineer? how did you pick these? 
    
    Explain the final model > which variables were included in the model, and what was the model performance?
    
    Explain how this model can be applied to the future to predict outcomes > how will the business use the model? 

Precision, Recall, Accuracy, and F1 are metrics commonly used in evaluating the performance of classification models. They each provide different insights into the model's performance, and the choice of which metric to use depends on the specific goals and characteristics of your application.

1. *Precision*: Precision is the ratio of true positive predictions to the total number of positive predictions made by the model. It focuses on how many of the predicted positive cases are actually true positives. Precision is important when the cost of false positives is high. For example, in medical diagnoses, you want to be very sure that a positive prediction is indeed correct to avoid unnecessary treatments or interventions.

   Precision = TP / (TP + FP)

2. **Recall (Sensitivity)**: Recall is the ratio of true positive predictions to the total number of actual positive cases. It measures how well the model identifies all positive cases, emphasizing the avoidance of false negatives. Recall is crucial when the cost of false negatives is high. For instance, in a spam email classifier, you want to ensure that all actual spam emails are identified, even if it means some false positives.

   Recall = TP / (TP + FN)

3. *Accuracy*: Accuracy is the ratio of correct predictions (both true positives and true negatives) to the total number of predictions. It gives an overall measure of how well the model performs across all classes. However, accuracy can be misleading, especially when dealing with imbalanced datasets where one class dominates the others.

   Accuracy = (TP + TN) / (TP + TN + FP + FN)

4. *F1 Score*: The F1 score is the harmonic mean of precision and recall. It's a balanced metric that takes both false positives and false negatives into account. The F1 score is particularly useful when you want to find a balance between precision and recall, especially when dealing with imbalanced datasets. It becomes especially important when you can't favor one over the other.

   F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

In summary:
- Use *Precision* when false positives are more critical.
- Use *Recall* when false negatives are more critical.
- Use *Accuracy* as a general metric when class distribution is balanced.
- Use *F1 Score* when you want a balance between precision and recall, especially in imbalanced datasets.

It's important to choose the metric that aligns with your specific goals and the context of your problem. In some cases, you might need to use a combination of these metrics and consider the trade-offs between them.


Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves are graphical tools used to evaluate the performance of binary classification models, particularly in cases where you want to understand how the model's prediction threshold affects its performance. They are particularly useful when dealing with imbalanced datasets or when you need to make decisions about the trade-offs between true positives, false positives, true negatives, and false negatives.

Here's when to use each curve:

1. *ROC Curve*:
   - Use ROC curves when you want to evaluate a model's performance across various thresholds for classifying positive and negative instances.
   - ROC curves are well-suited for cases where class distribution is roughly balanced.
   - ROC curves visualize the trade-off between Sensitivity (Recall) and Specificity (True Negative Rate).
   - ROC curves can help you identify an optimal threshold that balances true positive rate and false positive rate based on your specific needs.
   - The area under the ROC curve (AUC-ROC) is often used as a summary metric of the model's discriminatory power. A higher AUC-ROC indicates better model performance.


2. *Precision-Recall Curve*:
   - Use Precision-Recall curves when dealing with imbalanced datasets, where the number of negative instances significantly outweighs the positive instances.
   - Precision-Recall curves focus on the trade-off between Precision and Recall at different thresholds.
   - Precision-Recall curves are particularly informative when you want to emphasize the performance on positive instances (e.g., rare events, anomalies, positive diagnoses) rather than negative instances.
   - Precision-Recall curves are often used in scenarios where the cost of false positives is high, and you want to ensure that positive predictions are highly reliable.
   - The area under the Precision-Recall curve (AUC-PR) provides a single value that summarizes the model's ability to balance precision and recall. Higher AUC-PR values indicate better model performance.

In summary:
- Use *ROC curves* when you want to explore the trade-off between true positive rate and false positive rate across various classification thresholds. They are well-suited for balanced datasets.
- Use *Precision-Recall curves* when you want to understand the trade-off between precision and recall, especially in cases with imbalanced datasets or when the cost of false positives is high.

Both ROC curves and Precision-Recall curves provide valuable insights into model performance, and the choice between them depends on your specific problem, class distribution, and the importance of different types of errors.