# Project 2

It is October 2018. The squirrels in Central Park are running into a problem and we need your help.

For this project you must go through most steps in the checklist. You must write responses for all items however sometimes the item will simply be "does not apply". Some of the parts are a bit more nebulous and you simply show that you have done things in general (and the order doesn't really matter). Keep your progress and thoughts organized in this document and use formatting as appropriate (using markdown to add headers and sub-headers for each major part). Do not do the final part (launching the product) and your presentation will be done as information written in this document in a dedicated section, no slides or anything like that. It should however include the best summary plots/graphics/data points.

You are intentionally given very little information thus far. You must communicate with your client (me) for additional information as necessary. But also make sure that your communications are efficient, thought out, and not redundant as your client might get frustrated and "fire" you (this only applies to getting information from your client, this does not necessary apply to asking for help with the actual project itself - you should continuously ask questions for getting help).

You must submit all data files and a pickled preprocessor and final model along with this notebook.

## Define the objective in business terms
<mark>The primary business objective is to efficiently manage resources in Central Park to mitigate the spread of a disease among the squirrel population. The goal is to identify and test squirrels that are likely infected by analyzing sighting data and behavioral patterns to preserve the ecological balance without exhausting available resources.</mark>

## How will your solution be used?
<mark>The machine learning solution will be used to predict the likelihood of a squirrel being infected based on sightings and behavior patterns. This predictive model will help park authorities decide when to deploy testing resources to specific sightings, ensuring that interventions are timely and effective, minimizing unnecessary testing and focusing efforts on high-risk cases.</mark>

## How should you frame this problem?
<mark>This problem should be framed as a binary classification issue where the outcome to predict is whether a squirrel needs to be tested based on the likelihood of infection. Input features might include location data, time of sighting, behaviors exhibited, and any other environmental or direct observations noted during the sighting.</mark>

## How should performance be measured? Is the performance measure aligned with the business objective?
<mark>Performance should be measured using metrics such as accuracy, precision, recall, and the area under the ROC curve (AUC-ROC). Precision is particularly important as it reflects the proportion of true positives among all positive predictions, aligning with the business objective of maximizing resource efficiency and minimizing unnecessary tests.</mark>

## What would be the minimum performance needed to reach the business objective?
<mark>The minimum performance would likely require a high precision to ensure that only squirrels highly likely to be infected are tested. A precision threshold of at least 80% might be considered essential to avoid wasting limited testing resources and personnel time.</mark>

## What are comparable problems? Can you reuse (personal or readily available) experience or tools?
<mark>Comparable problems include medical diagnostic tests, wildlife monitoring for disease control, and other predictive maintenance tasks. Tools and experiences from these areas, such as decision trees, logistic regression, or more complex machine learning models like Random Forests or SVMs, could be adapted for use in this project.</mark>

## Is human expertise available?
<mark>Yes, park rangers, wildlife experts, and veterinarians would be available to provide insights into squirrel behaviors, symptoms of the disease, and other ecological factors that might influence the spread of the disease.</mark>

## How would you solve the problem manually?
<mark>Manually, the problem would be solved by having experts review each sighting report, assessing the likelihood of infection based on the squirrel's behavior, physical symptoms observed, and the context of the sighting (e.g., location, time, other environmental factors).</mark>

## List the assumptions you (or others) have made so far. Verify assumptions if possible.
<mark>Assumptions include:</mark>
- <mark>The data collected accurately represents the overall squirrel population and their health status.</mark>
- <mark>Behavioral patterns can reliably indicate health status.</mark>
- <mark>The disease has visible or otherwise detectable symptoms that can be observed in a park setting.</mark>
- <mark>Testing resources and personnel are limited and must be efficiently allocated.</mark>



Get the Data
============
1. List the data you need and how much you need
- <mark>Squirrel Sightings Data<mark>

- <mark>Squirrel Health Status Data<mark>

- <mark>Behavioral Observations Data<mark>

- <mark>Environmental and Weather Data<mark>





2.  Find and document where you can get that data

-  diseased_squirrels.csv - provides unique identfier for the squirrels, time of day and date 


- 2018 Central Park Squirrel Census - Squirrel Data

    https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw/data_preview


- 2018 Squirrel Census Fur Color Map
    https://data.cityofnewyork.us/Environment/2018-Squirrel-Census-Fur-Color-Map/fak5-wcft


- 2018 Central Park Squirrel Census - Stories (maybe)
    https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Stories/gfqj-f768/about_data




3. Get access authorizations


4. Create a workspace: 
- This notebook 

5.  **Get the data**: 
   

6. **Convert the data to a format you can easily manipulate**:
   - Provided in CSV files so easy to use.

7. **Ensure sensitive information is deleted or protected**:
- Public data already has been anonymized.

8. **Check the size and type of data (time series, geographical, …)**

    1. 
        <mark> TODO<mark>
    
    2. Is it a time series?
        <mark>- TODO<mark>
    
    3. Are any of the features unusable for the business problem?
        <mark>- TODO<mark>

    4. Which feature(s) will be used as the target/label for the business problem? (including which are required to derive the correct label)
        <mark>TODO<mark>

    5. Should any of the features be stratified during the train/test split to avoid sampling biases?
       <mark>TODO<mark>

Explore the Data
================


1. Copy the data for exploration, downsampling to a manageable size if necessary.

2. Study each attribute and its characteristics: Name; Type (categorical, numerical, bounded, text, structured, ...); % of missing values; Noisiness and type of noise (stochastic, outliers, rounding errors, ...); Usefulness for the task; Type of distribution (Gaussian, uniform, logarithmic, ...)

3. For supervised learning tasks, identify the target attribute(s)

4. Visualize the data

5. Study the correlations between attributes

6. Study how you would solve the problem manually

7. Identify the promising transformations you may want to apply

8. Identify extra data that would be useful (go back to “Get the Data”)