# Evaluation Notebook

## Evaluate Results

Previous evaluation steps dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. 

Moreover, evaluation also assesses other data mining results generated. Data mining results involve models that are necessarily related to the original business objectives and all other findings that are not necessarily related to the original business objectives, but might also unveil additional challenges, information, or hints for future directions.

### Outputs

<b>1. Assessment of data mining results with respect to business success criteria</b>
- Summarize assessment results in terms of business success criteria, including a final statement regarding whether the project already meets the initial business objectives.

   1. In the first stage of this project, I identified the success criteria as follows:
        - The biggest success criteria that is shared across all of the project’s objectives is learning valuable skills from this training. I believe this project will be a success if I come away from this project with more confidence in my abilities, an increased technical skillset, and/or a better understanding of ASR’s work environment. However, specific success critieria for each project objective will be defined in the following: Business Understanding: Success will be gaining a deeper understanding of the project and its purpose and throughly answering all questions asked. Data Understanding: Success will be learning more about the variables in the dataset and understanding more about how they can be leveraged in future analysis. I should have a clear plan for the data preparation stage once this stage is complete. Data Preparation: Success will be identifying outliers, missing observations, and delivering a dataset that is throroughly prepared for analysis. If I was successful in this stage, then I will not run into any data related issues when modeling. Modeling: Success will be creating a model that accurately predicts whether a stock should be sold or bought. Evaluation: Success will be determined using the AUC of the ROC curve.</br> </br>
   2. After completing the project, I believe that the project already meets the initial business objectives. After the completion of stage 4, the new skills and increased familiarity I have with machine learning, gitlab, and other ASR work-related programs is overwhelming. In addition, the final model that I created in stage 4 did well in terms of its predicting power despite the challenges that were inherent in the provided data set. The final results for the project were as follows:
       - **Baseline:** ACC - .624 , F1 - .500 , AUC - .616
       - **Final:** ACC - .634 , F1 - .562 , AUC - .631 

## Review Process

At this point, the resulting models appear to be satisfactory and to satisfy business needs. It is now appropriate to do a more thorough review of the data mining engagement in order to determine if there is any important factor or task that has somehow been overlooked. This review also covers quality assurance issues—for example: Did we correctly build the model? Did we use only the attributes that we are allowed to use and that are available for future analyses?

### Outputs

<b>3. Review of Process</b>
- Summarize the process review and highlight activities that have been missed and those that should be repeated. </br>
    - **Factor Review:** All the steps of the CRISP-DM model were implemented according to the objectives set out in the beginning of this training. </br>
     - **Data Exploration:**
       - Variable Names: 
         - The "year" variable was created and all variables between the data sets were renamed to match. All data sets were then combined into one data set. 
         - The only permanent changes made to the data set were the removal of 4 extreme outliers detected through Cook's Distance and two variables which were almost entirely NA's (operatingCycle & cashConversionCycle).
     - **Data Preparation:**
       - Duplicates:
         - Columns with duplicate names were removed. Then column sums were used to find similar or identical variables that had slightly different names but were still duplicates. Finally, these variables were merged to create a single variable.
       - Variable Names: 
         - All variables were converted to lower case and a variety of steps were taken to make them uniform. 
       - Categorical Encoding:
         - "sector" was changed to a number "sector_num" to ease the machine learning process.
       - Missing Data:
         - Removed those rows with a sum of missing data greater than 50.
         - Removed those columns with missing data greater than 15%.
       - Imputation: 
         - Missing data was imputed using the median value.
     - **Data Modeling:**
       - Base Model: 
         - After trying various models, the random forest model was chosen and the base model was completed. 
         - Data was trained/split according to a 70/30 ratio with "year" being removed so that the model would not rely so heavily on it for classification.
       - Model Improvement:
         - The data set was scaled using using the standard deviation and imputed according the median as well as using a Knn model. The Knn model was chosen since it was similar to the median in classification power but offered more variability in the data set. It was assumed that this would mimic the real data set more closely.
       - Feature Selection: 
         - Decision trees were run for each year to see the importance of different variables. All variables that had an importance greater than 0 were chosen and combined in a reduced data set. This dataset led to worse results than the full dataset when used on the random forest model so the decision was to keep the full dataset moving forward.
       - Hypertuning:
         - The mtry value was tuned to find the optimal value.
       - Additional Tuning: 
         - The ntree value was also expanded to included larger values than 500. The results showed that 500 was still the best value to choose.
         - Controls were selected and repeated cross-validation was implemented in the final model. This included the value of 500 for ntree, but upon using the previous mtry value results were sub-optimal. It was decided to let the cross-validation model choose its own optimal mtry and it chose 88.
    - **Model Review:** Using only the included data set, I believe that the model was built correctly using only the attributes that were allowed. The code for the final model is listed below:

In [None]:
#Code from the final model in stage 4#
ntree=500
control <- trainControl(method="repeatedcv", repeats = 2, number=2, search="random")
set.seed(123)
fit.rf <- train(class~., data=train, method="rf", ntree = ntree)
fit.rf

## Determine Next Steps

Depending on the results of the assessment and the process review, the project team decides how to proceed. The team decides whether to finish this project and move on to deployment, initiate further iterations, or set up new data mining projects. This task includes analyses of remaining resources and budget, which may influence the decisions.

### Outputs

<b>4. List of possible actions</b>
- List the potential further actions, along with the reasons for and against each option. </br>
     - Identify additional data sets to improve the model's performance: The benefits of this are discussed in the decision section below.
     - Deploy the model as-is: This would allow for a reasonably accurate model to be presented to the client but it would not provide the best possible model to them.
     - Further tune the model's parameters: Additional tuning of the parameters could increase the model's performance but only ever so slightly. The amount of time and effort does not lead to great enough returns to go down this route. 

<b>5. Decision</b>
- Describe the decision as to how to proceed, along with the rationale.
     - Although I used the available resources for this training, I do not think that the current model is the best choice for deployment. Due to huge impact certain years have on the returns of stocks, I think that any model that is attempting to predict whether stocks should be sold or not should have larger macroeconomic variables included. For example, take what is happening right now in the economy. A company's financial records are not going to be able to predict a recession or shock to the stock market in most instances. Accurate economic forecasts however could be included with additional years of financial information for stocks so that better predictions can be made. 

## Code Base Update

### Outputs

<b>6. Update code base</b>
- Update code base with any new functions/classes which haven't been implemented yet. I.e. a new preprocessing function or a new visualization technique
- Suggest further improvements to functions/classes within code base </br>
    - All of my code has been updated accordingly and my notebooks, scripts, and data sets have been pushed to the shared GitLab.