
# Phase 3 Project Description

Congratulations! You've made it through another _intense_ module, and now you're ready to show off your newfound Machine Learning skills

All that remains in Phase 3 is to put your new skills to use with another large project.

In this project description, we will cover:

* Project Overview
* Deliverables
* Grading
* Getting Started

Sure, I'll simplify it for you:

## Project Overview

You'll work through the entire data science process to solve a **classification** problem using a **dataset you choose**.

### Business Problem and Data

You need to:
1. Define a stakeholder (who cares about the problem).
2. Identify a business problem.
3. Choose a dataset.

For more details, check [Phase 3 Project - Choosing a Dataset](https://github.com/learn-co-curriculum/dsc-phase-3-choosing-a-dataset).

### Key Points

#### Classification

Remember the difference between *classification* and *regression*:

- **Classification**: Target variable is a *category* (e.g., Yes/No, Red/Blue).
- **Regression**: Target variable is a *numeric value* (e.g., price, height).

For this project, you must work on a **classification problem**.

#### Findings and Recommendations

In previous projects, you focused on understanding data (descriptive and inferential). Now, you should also use a ***predictive*** approach.

Predictive *findings* might include:
- How well your model predicts the target.
- Important features for your model.

Predictive *recommendations* might include:
- When your model's predictions are useful.
- How the business can change inputs to get desired results.

#### Iterative Approach to Modeling

You need to build multiple models. Start with a basic model, evaluate it, and then improve it. Discuss your final model in 1-3 paragraphs.

Explore:
1. Model features and preprocessing.
2. Different models (e.g., logistic regression, decision trees).
3. Different model settings (hyperparameters).

You must build at least two models:
- A simple baseline model (e.g., logistic regression).
- An improved version with tuned settings.

#### Classification Metrics

Choose the right metrics to evaluate your models. Use these metrics to assess performance on both training and testing data.

Sure, I'll simplify it for you:

## Deliverables

You need to submit three things:

1. A **non-technical presentation**
2. A **Jupyter Notebook**
3. A **GitHub repository**

### Non-Technical Presentation

This is a slide deck for business stakeholders. Present it live and submit a PDF on Canvas. Follow this structure:

1. **Beginning**
   - Overview
   - Business and Data Understanding
2. **Middle**
   - Modeling
   - Evaluation
3. **End**
   - Recommendations
   - Next Steps
   - Thank you

Explain classification modeling in simple terms. Assume the audience knows little about machine learning. Translate metrics and features into plain language.

### Jupyter Notebook

This notebook uses Python and Markdown to present your analysis to a data science audience. Submit it in PDF format on Canvas and `.ipynb` format on GitHub.

Include these sections:

- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Code Quality

### GitHub Repository

This is where you store all your project files and their version history. Follow the same requirements as in Phase 1 and 2, but include these sections in the `README.md`:

- Overview
- Business and Data Understanding (explain your stakeholder and dataset choice)
- Modeling
- Evaluation
- Conclusion

The `README.md` should bridge your non-technical presentation and the Jupyter Notebook. It should explain your methodology and analysis in more detail than the slides but without the code.

## Grading

To pass, you must meet each project rubric objective for Phase 3:

1. ML Communication
2. Data Preparation for Machine Learning
3. Nonparametric and Ensemble Modeling

### ML Communication

In Phase 3, we focus on ML Communication, which means explaining the **performance** and **insights** from machine learning models to different audiences through writing, presentations, and visuals.

High-quality ML Communication includes:

* **Rationale:** Why use machine learning instead of simpler analysis?
  * Why is this problem or data suitable for machine learning?
  * For data scientists, explain why you made changes between models.
* **Results:** Describe the classification metrics.
  * Report multiple metrics but explain why you chose them.
  * For business audiences, connect metrics to real-world implications without technical details.
  * For data scientists, explain why you chose specific metrics.
* **Limitations:** Identify limitations or uncertainties in your analysis.
  * Are there records where the model performs worse? What problems might this cause in production?
  * Be more detailed for data scientists and simpler for business audiences.
* **Recommendations:** Interpret results and limitations in the business context.
  * What should stakeholders do with this information?

#### Exceeds Objective

Communicates rationale, results, limitations, and specific recommendations from a classification model.

#### Meets Objective (Passing Bar)

Successfully communicates model metrics without major errors.

#### Approaching Objective

Communicates model metrics with at least one major error (e.g., using a regression metric for a classification model or reporting training data performance instead of test data).

#### Does Not Meet Objective

Does not communicate model metrics. Simply displaying a `classification_report` or confusion matrix is not enough. Focus on specific metrics important for your business case.

### Data Preparation for Machine Learning

This means getting your data ready for predictive modeling by:

1. Handling missing and non-numeric data.
2. Preventing data leakage.
3. Scaling data if needed.

**Preventing Data Leakage:** Make sure your model's performance on test data is realistic. Fit transformers on training data only, then use them to transform both training and test data.

**Scaling:** If using distance-based models (like kNN or logistic regression), scale your data before fitting the model.

Feature engineering (creating new features) is encouraged but not required.

#### Exceeds Objective

Goes beyond basic preparation, like using feature engineering or pipelines.

#### Meets Objective (Passing Bar)

Prepares data correctly, using transformers on training data and scaling when needed.

#### Approaching Objective

Prepares some data but makes at least one major error (e.g., fitting transformers on test data, not splitting data, not scaling for distance-based models).

#### Does Not Meet Objective

Fails to prepare data for modeling, making the model unable to run.

### Nonparametric and Ensemble Modeling

Your project should explore different types of models to see which ones work best for your data and business problem.

- Your final model can be a linear model (like logistic regression), but you should also try at least one nonparametric model (like a decision tree) and explain why one is better.

#### Exceeds Objective

Goes beyond basic requirements, like explaining why a model is best or using advanced scikit-learn models.

#### Meets Objective (Passing Bar)

Uses at least two types of scikit-learn models and tunes at least one hyperparameter correctly.

#### Approaching Objective

Builds multiple models but makes at least one major error (e.g., including the target as a feature, using a numeric target for classification, not starting with a baseline model, or not tuning hyperparameters correctly).

#### Does Not Meet Objective

Does not build multiple classification models.

## Getting Started

1. Review the project description.
2. Ask your instructor if you have questions.
3. Complete the Project Proposal.
4. Create a new GitHub repository (don't fork the template).

For more info, check [Phase 3 Project - Choosing a Dataset](https://github.com/learn-co-curriculum/dsc-phase-3-choosing-a-dataset).

## Summary

This project helps you expand your data science skills by working with new datasets. Choosing a good dataset will help you avoid problems later. You've got this!