## Length of the report {-}
The length of the report must be no more than 15 pages, when printed as PDF. However, there is no requirement on the minimum number of pages.

You may put additional stuff as Appendix. You may refer to the Appendix in the main report to support your arguments. However, your appendix is unlikely to be checked while grading, unless the grader deems it necessary. The appendix, references, and information about GitHub and individual contribution will not be included in the page count, and there is no limit on the length of the appendix.

**Delete this section from the report, when using this template.** 

## Background / Motivation

What motivated you to work on this problem?

Mention any background about the problem, if it is required to understand your analysis later on.

Our group decided to study the deaths related to alcohol use disorders since we noticed an unhealthy relationship within our society. In the United States, the legal drinking age is 21; however, this law is broken by a large proportion of people. As a result, the youth is not taught how to drink responsibly from an early age, and we believe this partly contributes to this unhealthy relationship. This observation is what initially attracted our group to look into alcohol datasets. 

Once we began examining different datasets, we started to think about the best data to look at to make a prediction model. After mining through many different datasets and considering the various implications, we examined the factors influencing alcohol-related mortality. We chose this dataset because it had sufficient predictors we could use to generate a compelling model.

## Problem statement 

Describe your problem statement. Articulate your objectives using absolutely no jargon. Interpret the problem as inference and/or prediction.

Through limited research, we discovered that alcohol is responsible for 3 million deaths annually (5.3% of all deaths worldwide), according to the World Health Organization (WHO). We wanted to investigate this data more as we believe understanding the trends of alcohol-related deaths across different countries can help identify risk factors and inform policies. Through this project, we wanted to uncover which habits/indicators contribute to mass unhealthy drinking.

In the context of this problem, we created a logistic model to classify a country as either high risk or not high risk. We defined this parameter as countries with attributable mortality rates greater than 6%. 

Overall, our problem was mostly an inference problem. It is in part a prediction problem because at the end of our project, we wanted to be able to predict if the country is high risk given the predictors and nothing else. However, it is more inference because our goal is not to predict a country's class but primarily to understand the relationship between the predictors and the response to understand which factors are leading to this high death rate. 

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

As we were researching this topic, we found a very information article which helped give us a general idea of whcih predictors may have a large role in the high percentage of deaths (link: https://ourworldindata.org/alcohol-consumption). This article provides many interactive visuals that gave us crucial background information to create our model. Each visual included a CSV with the data used which we downloaded and cleaned to use in our model. 

## Stakeholders
Who cares? If you are successful, what difference will it make to them?

The model has three main stakeholders, educators, health services, and governments, each with a crucial role in reducing the negative effects of alcohol misuse.

Educators are one of the primary stakeholders in this project. By analyzing which factors contribute to a country being classified as high-risk, schools can effectively educate the youth on how to have healthy drinking habits. For example, they can focus on promoting responsible drinking behavior and provide information on the health consequences of alcohol misuse. Additionally, they can work on creating prevention programs that target high-risk individuals or groups. 

The second stakeholder, health services, plays a critical role in reducing the death rate related to alcohol misuse. This will enable them to identify high-risk individuals before they die, based on their demographic and other information available in our predictors. For example, doctors can use this information to identify patients who may be at risk for alcohol-related illnesses or who may require additional support to avoid harmful drinking behaviors. 

Governments are the third stakeholder in our project. By using our statistical analysis, they can more effectively target their laws and policies to foster a society with healthy drinking habits. For example, they may want to implement stricter regulations on when and how much an individual can buy alcohol.

## Data quality check / cleaning / preparation 

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).

In [10]:
import pandas as pd
import numpy as np
df = pd.read_csv('./Datasets/merged.csv')
df = df.iloc[:,3:]
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
wine_as_share_alcohol_consumption,172.0,15.97616,17.4645,0.0,2.1,8.4,27.925,80.2
beer_as_share_alcohol_consumption,172.0,44.01279,22.15635,2.3,29.275,40.4,57.45,100.0
spirits_as_share_alcohol_consumption,172.0,28.78895,23.6574,0.0,10.5,23.85,42.15,97.3
tot_alcohol_consumption_per_capita_liters,176.0,6.147551,4.199294,0.003,2.19,5.91,9.58,18.35
gdp_per_capita,176.0,19310.67,20029.81,825.205688,4382.449,11934.86,27912.89,113182.7
Population,176.0,41198210.0,149197500.0,89958.0,2299560.0,8799540.0,29331180.0,1393715000.0
percent_who_have_not_drank_alcohol_in_last_year,175.0,60.60286,23.3348,8.2,40.8,63.6,78.1,99.9
perc_life_no_drink,181.0,44.18398,28.28417,3.4,17.9,41.6,66.4,99.5
total_alcohol_consumption_per_capita_liters,182.0,6.044143,4.194188,0.003,2.1525,5.825,9.55,18.35
percent_who_have_drank_alcohol_in_last_year,174.0,39.4454,23.39338,0.1,21.9,36.4,59.45,91.8


Our group collected all of our data from WHO. We cleaned each individual datasets and then merged them all together. The majority of these predictors were each in their own table so we had many values to merge on Country. 

We ultimatly resulted in 183 observationes (respresewnting 183 different countries - population data) and 37 total predictors. We then classified a high death rate as a mortality rate above 6% as a new binary column in our merged dataset. 

## Exploratory data analysis

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

## Approach

What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?

Is there anything unorthodox / new in your approach? 

What problems did you anticipate? What problems did you encounter? Did the very first model you tried work? 

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?

**Important: Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.**

We used a logistic model since our response variable was binary (high mortality rate/not high mortiality rate), and we sought to optimize classification accuracy. Though, as we were developing the model, we felt it was less risky to classify a country as high risk (and have them address an issue) than leave a high risk country vulnerable. As a result, we also worked to minimize the false negative rate. 
Since our dataset had very low observations and a lot of predictors, we definitely anticipated many problems. With regards to the low observations, we ran into issues with dealing with how to subset the data for training and testing purposes so that we could get the best model possible. With regards to the predictors, we ran into the issue of having too many combinations to choose from. This was difficult for us when it came to manually trying out different models (based on our EDA) because we couldn't possible exercise all of our options. It was also difficult to use subset selection methods because having that many predictors is very computationally taxing, so some methods took a very long time to execute and others simply were not even feasible.

## Developing the model

Explain the steps taken to develop and improve the base model - informative visualizations / addressing modeling assumption violations / variable transformation / interactions / outlier treatment / influential points treatment / addressing over-fitting / addressing multicollinearity / variable selection - stepwise regression, lasso, ridge regression). 

Did you succeed in achieving your goal, or did you fail? Why?

**Put the final model equation**.

**Important: This section should be rigorous and thorough. Present detailed information about decision you made, why you made them, and any evidence/experimentation to back them up.**

## Limitations of the model with regard to inference / prediction

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

Our data is from 2016, which is certainly not ideal for inferences on the current world. However, there is not much reason to believe that the significance of our predictors and model should be considered obsolete any time soon, as there likely isn't much variety between the values of our predictors accross the years. Additionally, we have data on a great majority of the countries, so our model should be relevant to stakeholders accross most countries. However, the data is mostly from larger countries, so the model may not be as tailored to the conditions of smaller countries, and our reccomendations may not be as accurate for them.

## Other sections *(optional)*

You are welcome to introduce additional sections or subsections, if required, to address any specific aspects of your project in detail. For example, you may briefly discuss potential future work that the research community could focus on to make further progress in the direction of your project's topic.

## Conclusions and Recommendations to stakeholder(s)

What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.

How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable? 

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

Our final model contained only 5 predictors out of the initial 41, so we think that these predictors ended up yielding the best model because of how relevant they are to our response variable. Thus, our reccomendations are primarily inferences based off of the indicators' presences in our model. Here, we outline these reccomendations by the predictors:
- The predictor `daly_alc_use_disorders__all_ages_standardized` can help health services and governments as its significance in the final model teaches how impactful alcohol use disorders can be to a country's mortality rate. Health services can use this information to try to get more adept at spotting and treating alcohol use disorders, and governments can use it to provide funding for such endeavors by health services or by providing more programs to help people with alcohol use disorders. 
- The predictor `perc_heavy_drinkers_both_sexes` can help all stakeholders. Educators can use this information to present the true harms ot heavy drinking to their country's youth, and health services and governments can use this information in a similar way that they would for the previous predictor.
- The predictor `spirits_as_share_alcohol_consumption` can also help all stakeholders. Educators can inform the youth on the potential dangers of frequently using spirits, health services can try to focus on developing methods that better treat spirits, and governments may want to limit the production/distribution/consumption of spirits.
- The predictor `total_alcohol_consumption_per_capita_liters` can primarily help governments. They can monitor this predictor to gauge whether their contry may be getting close to an dangerous amount of alcohol consumption, and then take precautions accordingly.
- The predictor `gdp_per_capita` can also primarily help governments. It would likely be used in a similar fashion as the previous predictor, just as something to monitor along with the indicator variable.

Most of these reccomendations should be fairly implementable and practical, to the point where they should be (at the minimum) worthy of consideration by the stakeholders. That said, the stakeholders may want to be aware that our data is from 2016, so a more accurate model may be constructed if we had more recent data. 

## GitHub and individual contribution {-}

https://github.com/jtroxel7/STAT303-2-LCJJ

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Lainey Neild</td>
    <td>Data cleaning and EDA</td>
    <td>Cleaned data to impute missing values and developed visualizations to identify appropriate variable transformations.</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Charlie Lovett</td>
    <td>Assumptions and interactions</td>
    <td>Checked and addressed modeling assumptions and identified relevant variable interactions.</td>
    <td>120</td>
  </tr>
    <tr>
    <td>Jacob Muriel</td>
    <td>Outlier and influential points treatment</td>
    <td>Identified outliers/influential points and analayzed their effect on the model.</td>
    <td>130</td>    
  </tr>
    <tr>
    <td>Jack Troxel</td>
    <td>Variable selection and addressing overfitting</td>
    <td>Performed variable selection on an exhaustive set of predictors to address multicollinearity and overfitting.</td>
    <td>150</td>    
  </tr>
</table>

We did not face any major challenges while using GitHub. We did, however, find that it was easier for us to work on the project in person off of a singular computer. When we were going through the model development process, it was tricky to make progress on one computer and then pick up where we left off on another. Overall, we found that working in person was more efficient for us, so we did not use GitHub as much as other groups may have.

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Authors. The frobnicatable foo filter, 2014. Face and Gesture submission ID 324. Supplied as additional material
fg324.pdf. 3


## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.