## Length of the report {-}
The length of the report must be no more than 15 pages, when printed as PDF. However, there is no requirement on the minimum number of pages.

You may put additional stuff as Appendix. You may refer to the Appendix in the main report to support your arguments. However, your appendix is unlikely to be checked while grading, unless the grader deems it necessary. The appendix, references, and information about GitHub and individual contribution will not be included in the page count, and there is no limit on the length of the appendix.

**Delete this section from the report, when using this template.** 

## Code should be put separately in the code template {-}
Your report should be in a research-paper like style. If there is something that can only be explained by showing the code, then you may put it, otherwise do not put the code in the report. We will check your code in the code template. 

**Delete this section from the report, when using this template.** 

## Background / Motivation

What motivated you to work on this problem?

Mention any background about the problem, if it is required to understand your analysis later on.

## Problem statement 

Describe your problem statement. Articulate your objectives using absolutely no jargon. Interpret the problem as inference and/or prediction.

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

## Stakeholders
Who cares? If you are successful, what difference will it make to them?

## Data quality check / cleaning / preparation 

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).

## Exploratory data analysis

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

## Approach

What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?

We used a linear model because this was a prediction problem. The metrics we paid most attention to in order to measure the accuracy of our model were RMSE, RSE, and to a smaller extent, R squared. We attempted to minimize RMSE and RSE and also make sure they were close together, to prevent overfitting or underfitting. 

Is there anything unorthodox / new in your approach? 

What problems did you anticipate? What problems did you encounter? Did the very first model you tried work? 

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?

**Important: Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.**

## Developing the model

Explain the steps taken to develop and improve the base model - informative visualizations / addressing modeling assumption violations / variable transformation / interactions / outlier treatment / influential points treatment / addressing over-fitting / addressing multicollinearity / variable selection - stepwise regression, lasso, ridge regression). 

Did you succeed in achieving your goal, or did you fail? Why?

**Put the final model equation**.

**Important: This section should be rigorous and thorough. Present detailed information about decision you made, why you made them, and any evidence/experimentation to back them up.**

#### Outliers and Influential Points

An outlier test obtained by plotting the fitted values of the model against the studentized residuals revealed 51 outliers (out of 1070 training observations) in the model. Although the outliers make up less than 5% of the training observations, it was still necessary to investigate why these outliers existed, and if they affected the quality of the model. The high leverage test (using a leverage cutoff of 4 times the average leverage) revealed that there were 16 high leverage points in the training dataset, but only 1 was a highly influential point (high leverage and outlier). However, removing the influential point increased both the RSE and RMSE and decreased the R-squared of the model– overall, the quality of the model decreased. This remained the case after testing for outliers with multiple values in np.random.seed(). To further our understanding of the causes for the outliers, we investigated the trends between the predictors and the outliers. While all of the other predictors had a normal distribution in the outliers, over 82.35% of the outliers were non-smokers. Thus, we came to the conclusion that there must be some predictors that are not represented in our dataset which are causing these skews. We also chose to keep all outliers and the influential point in the datasets for developing the model because we believe that they remain significant in creating an accurate model, and additional information (such as more predictors) would be needed to consider removing them. 

## Limitations of the model with regard to inference / prediction

The information related to the predictors in our model will be very convenient to acquire for all stakeholders. The predictors in our model are all related to an individual’s health and general demographics (age, smoking history, bmi, region of residence, number of children, sex). Our stakeholders who are consumers of health insurance plans will get results from our model immediately, since they already have the predictor information on hand. It is the same case for healthcare workers (such as medical offices or hospitals. They already have the patient’s information on hand, so getting results from our model will also be immediate for these stakeholders. Our stakeholders who are the healthcare insurance companies will need to acquire information from potential customers before using our model to predict what prices they should be charging. This collection of information is simple for them because they will simply request for information from potential customers. They can then use the model and get results immediately. 

One point of concern in our model is the lack of some information that would be significant predictors in an insurance premium prediction model. This would include an individual’s income level, job occupation, and whether they would like a more basic or advanced health insurance plan. While our stakeholders all have this information on hand, our training dataset does not. We believe a more accurate model would implement these predictors. 

Another point of concern is the factor of pricing and economy. Our model is one that predicts the price of insurance premiums, but this is also highly reliant on the general state of the economy and healthcare industry. For instance, high rates of inflation or shortage of healthcare supplies will significantly impact pricing across multiple industries. In real life, healthcare insurance companies would adjust their prices according to these rate of inflation in the economy, but our model does not do so. Thus, our model would need to be recalibrated according to more updated training data whenever there are significant changes in the economy. 


## Other sections *(optional)*

You are welcome to introduce additional sections or subsections, if required, to address any specific aspects of your project in detail. For example, you may briefly discuss potential future work that the research community could focus on to make further progress in the direction of your project's topic.

## Conclusions and Recommendations to stakeholder(s)

Overall, our model tells stakeholder that individuals who are older, have a smoking history, have a higher bmi, have more children, and who are located in the northeast will have higher health insurance costs. An examples in numerical terms: for every year increase in age, an individual will be charged about $262 more for health insurance premiums. 

Our conclusion allows the stakeholders to get a relatively accurate estimate for health insurance premium costs. Stakeholders who are health insurance customers will be able to use our model to predict how much they will need to spend on health insurance. This will allow for easier and more accurate budgeting of their personal costs in everyday life. Health insurance customers who are older, have a smoking history, have more children, and have a higher bmi should allocate more money towards health insurance. Stakeholders who are the health insurance companies can use our model to predict how much to charge an individual for health insurance premiums. This will allow the companies to ensure higher profits, as they can minimize errors made in pricing insurance plans with our model. Finally, stakeholders who are the healthcare providers can use our model to predict how much money they will earn from patients. Healthcare insurance premiums are indirectly linked to how much the healthcare provider will earn. Thus, these healthcare providers should use our model when keeping track of their finances, as the model will give them more accurate profit forecasts as well. 

The RMSE of our model is $4100 and the RSE is 4541, and the R-squared is 0.87. Since the RMSE is smaller than the RSE there is definitely no overfitting in the data (the model seems to be working better for unseen data). As mentioned above, we believe that the model could be improved (lower RMSE and RSE) with additional predictors and information. When looking at how health insurance is typically priced, an individual’s income level and occupation are also considered. Thus, our stakeholders should be aware of this, and consider using both our model and additional information on top of it when pricing health insurance premiums. Another thing mentioned above is that the model should be updated when there are significant change in the economy (inflation and overall price changes). There is no formal timeline for this procedure (as the state of the economy does not follow a constant pattern), but the model should be re-trained with more updated data whenever appropriate. 


## GitHub and individual contribution {-}

**https://github.com/nataliekhao/STAT303-2-Project** for the project repository.

Add details of each team member's contribution in the table below.

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Natalie Khaoroptham</td>
    <td>Data cleaning and EDA</td>
    <td>Cleaned data to impute missing values and developed visualizations to identify appropriate variable transformations.</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Isha Sharma</td>
    <td>Assumptions and interactions</td>
    <td>Checked and addressed modeling assumptions and identified relevant variable interactions.</td>
    <td>120</td>
  </tr>
    <tr>
    <td>Raphael Tinio</td>
    <td>Outlier and influential points treatment</td>
    <td>Identified outliers/influential points and analayzed their effect on the model.</td>
    <td>130</td>    
  </tr>
    <tr>
    <td>Philia Wang</td>
    <td>Variable selection and addressing overfitting</td>
    <td>Performed variable selection on an exhaustive set of predictors to address multicollinearity and overfitting.</td>
    <td>150</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Authors. The frobnicatable foo filter, 2014. Face and Gesture submission ID 324. Supplied as additional material
fg324.pdf. 3


## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.