In [None]:
import pandas as pd
import seaborn as sns
import os
import sys
module_path = os.path.abspath(os.path.join('.'))
if module_path not in sys.path:
    sys.path.append(module_path)
import project_functions
url='https://raw.githubusercontent.com/data301-2020-winter2/course-project-group_1017/main/Data/Raw/medical_expenses.csv'

Our original data set is given below

In [None]:
dfo=pd.read_csv(url)
dfo

This data set is describing expected medical costs in America based on other variables, it is from Machine Learning with R by Brett Lantz. This dataset is pretty clean but I will do some wrangling to it so it can be better used for analysis. From this dataset I would like to be able to answer how unhealthy life styles such as under or over average bmi and smoking effect medical charges of an individual and to what extent these factors play a roll in expected charges of an individual 

**Data Wrangling**
---

---

First I would like to add some columns to this dataset and change some of the variable names to better suit the variable it is representing

In [None]:
df=project_functions.load_and_process(url)
df

Here i dropped the children and the region column as they wont be used in further analysis so they are redundant and I rounded the bmi amd charges to values that are more commonly used.

In [None]:
project_functions.mean(df)

Taking a mean of the charges column returns a value of 13270.42 which we used to determine if the patient is given excessive charges which means they pay out more than expected on average. Next i want to seperate the data sets into healthy and not healthy populations to see how excessive charges changes throughout the population

In [None]:
dfh = project_functions.Health(url)
dfun =project_functions.unHealth(url)

In [None]:
dfh

Here is the data set for healthy people, the requirements for this was to be in healthy bmi range of 18.5-24.9 and to not be a smoker.

In [None]:
dfun

Here is the data set for unhealthy people the requirments are the inverse of the healthy data set thus they have to be outside the healthy bmi range or be classified as a smoker. We can first look at some regression plots of age vs charges to see how having a unhealthy lifestyle effects the charges

**Regression Plots**
---

---

In [None]:
project_functions.plotAvC(df)

Regression plot for whole population

In [None]:
project_functions.plotAvC(dfh)

Regression plot for healthy indivduals

In [None]:
project_functions.plotAvC(dfun)

Regression plot for unhelathy individuals

from a quick look we can clearly tell from the regression line that having a unhealthy habit leads to on average a higher charge at each age with only the outliers of the healthy individuals coming close to the average of the unhealthy ones.This is fairly ordinary as more unhealthy habits require more health check uos etc.However on closer inspection we see that in this unhealthy group there is a lot of spread of data points with there looking like there is actually 3 populations in this dataset, this makes sense as the requirement of this dataset was to be smoker OR outside the healthy bmi range which results in 3 different datsets more on that later.

For now we can look at how more likely a indiviudal is to have excess charges if they have a unhealthy trait with frequency ar counts.

**Bar plots**
---

---

In [None]:
project_functions.BrPltECD(df)

Bar plot for whole population

In [None]:
project_functions.BrPltECD(dfh)

frequency count of if an indivdual is to recieve excess charges in healthy population

In [None]:
project_functions.BrPltECD(dfun)

frequency count of if an indivdual is to recieve excess charges in unhealthy population

This further reinforces that a unhealthly life choice leads to a higher chance of an individual receiving a higher than average medical bill charge. We saw that there was outliers on the regression plots and we also couldnt see how much the data was spread for the healthy and unhealthy groups to better visualise this we can use box plots

**Box plots**
---

---

In [None]:
project_functions.BoxPlt(df)

This further shows us the spread of the unhealthy charges this is due to their being multiple distrubutions collected in there but from the analysis of healthy vs unhealthy it further shows how much less the people with a healthy life style pay less charges. We can draw many conclusions just from this graph, we can see that 75% of the healthy population pay less than 10,000 whilst only 50% of the unhealthy population pays below this figure.

**different Unhealthy categories**
---

---

We mentioned earlier how in the unhelathy regression plot there seemed to be 3 different distributions now lets look at these in further depth and how they effect the charges, the 3 categories for this are under bmi,over bmi and smokers. First lets create data frames for each of these categories with only one variable per person

In [None]:
dfs=project_functions.smoker(url)
dfs

Data frame for smokers only

In [None]:
dfob=project_functions.overBmi(url)
dfob

Data frame for over average bmi

In [None]:
dfub=project_functions.underBmi(url)
dfub

Dataframe for under average bmi

**Regression Plots**
---

In [None]:
project_functions.plotAvC(dfs)

Regression plot for smokers only

We see here that the smokers at the least pay around 14,000 this is larger than the average charge for all patients, this suggests that the leading factor in large charges is smoking

In [None]:
project_functions.plotAvC(dfob)

Regression plot for over weight people

Here is where we see the majority of our population and this looks very similar to the regression plot of healthy individuals with a similar frequency of outliers above the regression line to the healthy case this points towards elimination of over average bmi from being a leading factor in above average charges.

In [None]:
project_functions.plotAvC(dfub)

Regression plot for underweight people only

This regression plot actually shows a under average bmi leads towards a much lower charge per age as the lowest charge is around 2000 whilst the maxium is only around 13,000, however this is a very small population compared to the other cases of only 14 so we cannot say this is for certain but thsi evidence combined with the over average bmi helps eliminate bmi being a leading cause in excess medical charges

These regression plots mostly show that the smokers have a large variety in the charges at each age and that bmi actually does not have too much of effect on the charges at each age and that a lower bmi actually looks like it causes a lower charge at each age group.

**Bar Plots**
---

---

In [None]:
project_functions.BrPltECD(dfs)

Bar plot for smokers only

This further shows that smoking is a leading factor in excess medical charges as from a respectiable sample size we've shown that everyone who smokes will have excess charges.

In [None]:
project_functions.BrPltECD(dfob)

Bar plot for over weight people only

Here we see that over weight people are more likely to have excess charges as compared to healthy people as they appear to be charged excess charges around 0.15 where as healthy people around around 0.1. This shows that having a over average bmi leads towards a higher chance of being charged more than average costs this is further backed up as this population was quite large so it can be used to draw a better conclusion, there was some outliers but we can compare those with bar plots later.

In [None]:
project_functions.BrPltECD(dfub)

Bar plot for under weight people only

Here we see something interesting in that under weight people only get charged under the average charges. This is interesting but could be down to the small sample size of this population because of this i wont draw a conclusion that a lower bmi leads to lower medical fees but it can be used to rule it out in a leading factor for higher fees.

**Box Plots**
---

---

In [None]:
project_functions.BoxPltob(df)

Box plot of not over weight vs over weight

Here we can see the over bmi population has a similar spread copared to healthy bmi but the over bmi is shifted upwards by a slight amount, we also see the over bmi has more outliers but this is expected due to a larger population size. This larger population size for over bmi could be due to the increasing obesity crisis in America thus more people are likely to be overweight as this is more normal this could be the reason that the frequency wasnt much higher for this group as its more likely to shift the mean of the dataset. So the higher frequency of excess charges and the more outliers in this case point towards the over bmi population having a larger medical charge on average as the outliers skew the mean thus it isnt a true representation of the mean which would need more investitgation.

In [None]:
project_functions.BoxPltub(df)

Box plot of not under weight vs under weight

Again we see that the average costs for under bmi is lower and the spread of the data seemes to be the same as the healthy case, there are no outliers in this under bmi population so the mean is more representative of the population.

In [None]:
project_functions.BoxPlts(df)

Box plot of non smokers vs smokers

The smoker population spread is clearly much larger and higher than the healthy population, we also see no outliers in this case so this spread is a good representation of the whole smoker population. It's obvious looking at this and the previous examples that smoking is the dominate factor in higher medical fees

From the previous plots we saw that bmi doesnt have too much of an influence in charges, and in fact a lower bmi looked to cause lower fees but the population was too small to draw concrete conclusions from, but in the over bmi case we could look at the higher bmi which would be closer to the obese cases and would look at more extreme cases from the average population. From this we also saw that smoking was the dominant factor for the higher fees which we will investigate some more.

**Smokers**
---

---

We saw from the previous analysis that isolating only idividuals who only smoked showed a trend of excessive medical charges now I want to look at how people who smoke and have different bmi differ with their medical charges

In [None]:
dfas=project_functions.allsmoker(url)
dfas

**Regression Plots**
---

---

In [None]:
project_functions.plotAvC(dfas)

**Bar Plots**
---

---

In [None]:
project_functions.BrPltECD(dfas)

**Box Plot**
---

In [None]:
project_functions.BoxPlts(dfas)

Looking at these  plots its clear to see that even when we extend the smokers case to all the people that do smoke that excess charges is always common between the population. Now we have a larger population we can draw a more concrete conclusion, the regression plot shows us that most of the above average charges are associated with smokers this is further backed up by the bar plot which shows almost all of the population of smokers have above average charges, the final nail for the argument that smoking is the leading factor in higher than average charges is the box plot which shows no outliers and the smallest fee being only just below the excessive charges threshold.

**Obese**
---

---

Now i want to look at how extreme obeseity effects the charges, this is classified as a value of 40 or higher

In [None]:
dfobs=project_functions.obese(url)
dfobs

**Regression Plots**
---

---

In [None]:
project_functions.plotAvC(dfobs)

**Bar plots**
---

---

In [None]:
project_functions.BrPltECD(dfobs)

**Box plots**
---

In [None]:
project_functions.BoxPltob(dfobs)

This is quite suprising in that the extrme case of obesity shows to have similar distributions to the healthy population with the frequency plot and box being almost similar and the regression plot showing nothing out of the ordinary for our other plots.This shows that even extreme obesity doesnt effect medical charges to a large degree, this is suprising as I thought this would of had a substancial impact on charges.

All of this taken into acount we can say with confidence that the leading factor in excessive charges is smoking and not bmi of an individual.