# U.S. Medical Insurance Costs

## Tasks

[x] Import a dataset into your program
- Analyze a dataset by building out functions or class methods
- Use libraries to assist in your analysis
- Optional: Document and organize your findings
- Optional: Make predictions about a dataset’s features based on your findings

## Scoping Your Project

### Preparation

[x] check the data
    - check for missing data
    - clean the data
        - convert _nominal_ and _ordinal_ data from `str` to `int` (e.g. _binary_ (_nominal_) `sex` data to `0` or `1`)[1^]
[^1]: The Pandas Category Data Type
When working with categorical variables in Python, especially ordinal categorical variables, it can often be advantageous to use the Pandas specific category datatype, which allows you to store category names with associated values and rankings.

### Exploration

- [ ] analyse data by the means of descriptive statistics
    - e.g. mean, min, max, median, mean, Boxplot  

### Exploitation

- [ ] check for by the means of exploratory statistics
    - check for linear correlations R^2
    - graph and chart data
    - make a (machine learning) model 






#### Import a dataset into your program
1. check whether headers are available

In [179]:
import csv

with open('insurance.csv',mode='r',newline='') as insurance_csv:
    sniffer = csv.Sniffer()
    insurance_reader = insurance_csv.read()
    has_header = sniffer.has_header(insurance_reader)
    print(has_header)

True


2. read the csv file

In [180]:
with open('insurance.csv',mode='r',newline='') as insurance_csv:
  reader = csv.DictReader(insurance_csv, restval=None)
  insurance_data = []
  for row in reader:
    insurance_data.append(row)
  print('first rows of source data: ', insurance_data[:1])

# test = [{'height':20},{None:True},{'age':None}]
count_missing_data = 0
for dict in insurance_data:
  if None in dict.values():
      count_missing_data += 1

print('Total of {} instances is missing in the source data'.format(count_missing_data))


first rows of source data:  [{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}]
Total of 0 instances is missing in the source data


3. clean the data
- convert _nominal_ and _ordinal_ data from `str` to `int` (e.g. _binary_ (_nominal_) `sex` data to `0` or `1`)[1^]

In [181]:
sex_mapping = {
    'category': 'sex',
    'male': 0,
    'female' : 1
}

smoker_mapping = {
    'category': 'smoker',
    'yes': 1,
    'no' : 0
}

def convert_nominal_to_int(in_data:list, mapping:dict) -> list:
    for datum in in_data:
        key = mapping['category']
        in_value = datum[key]
        out_value = mapping[in_value]
        
        datum[key] = out_value

convert_nominal_to_int(insurance_data,sex_mapping)
convert_nominal_to_int(insurance_data,smoker_mapping)

print(insurance_data[0:10])

[{'age': '19', 'sex': 1, 'bmi': '27.9', 'children': '0', 'smoker': 1, 'region': 'southwest', 'charges': '16884.924'}, {'age': '18', 'sex': 0, 'bmi': '33.77', 'children': '1', 'smoker': 0, 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 0, 'bmi': '33', 'children': '3', 'smoker': 0, 'region': 'southeast', 'charges': '4449.462'}, {'age': '33', 'sex': 0, 'bmi': '22.705', 'children': '0', 'smoker': 0, 'region': 'northwest', 'charges': '21984.47061'}, {'age': '32', 'sex': 0, 'bmi': '28.88', 'children': '0', 'smoker': 0, 'region': 'northwest', 'charges': '3866.8552'}, {'age': '31', 'sex': 1, 'bmi': '25.74', 'children': '0', 'smoker': 0, 'region': 'southeast', 'charges': '3756.6216'}, {'age': '46', 'sex': 1, 'bmi': '33.44', 'children': '1', 'smoker': 0, 'region': 'southeast', 'charges': '8240.5896'}, {'age': '37', 'sex': 1, 'bmi': '27.74', 'children': '3', 'smoker': 0, 'region': 'northwest', 'charges': '7281.5056'}, {'age': '37', 'sex': 0, 'bmi': '29.83', 'children': '2', 

In [182]:
def change_type(in_data:list, category: str, out_type: 'function') -> list:
    for datum in in_data:
        key = category
        in_value = datum[key]
        out_value = out_type(in_value)

        datum[key] = out_value

change_type(insurance_data, 'age', float) # str() didn't work; I don't know why
change_type(insurance_data, 'bmi', float)
change_type(insurance_data, 'charges', float)

print(insurance_data[0:10])

[{'age': 19.0, 'sex': 1, 'bmi': 27.9, 'children': '0', 'smoker': 1, 'region': 'southwest', 'charges': 16884.924}, {'age': 18.0, 'sex': 0, 'bmi': 33.77, 'children': '1', 'smoker': 0, 'region': 'southeast', 'charges': 1725.5523}, {'age': 28.0, 'sex': 0, 'bmi': 33.0, 'children': '3', 'smoker': 0, 'region': 'southeast', 'charges': 4449.462}, {'age': 33.0, 'sex': 0, 'bmi': 22.705, 'children': '0', 'smoker': 0, 'region': 'northwest', 'charges': 21984.47061}, {'age': 32.0, 'sex': 0, 'bmi': 28.88, 'children': '0', 'smoker': 0, 'region': 'northwest', 'charges': 3866.8552}, {'age': 31.0, 'sex': 1, 'bmi': 25.74, 'children': '0', 'smoker': 0, 'region': 'southeast', 'charges': 3756.6216}, {'age': 46.0, 'sex': 1, 'bmi': 33.44, 'children': '1', 'smoker': 0, 'region': 'southeast', 'charges': 8240.5896}, {'age': 37.0, 'sex': 1, 'bmi': 27.74, 'children': '3', 'smoker': 0, 'region': 'northwest', 'charges': 7281.5056}, {'age': 37.0, 'sex': 0, 'bmi': 29.83, 'children': '2', 'smoker': 0, 'region': 'northeas

#### Analyze a dataset by building out functions or class methods

* Minimum
* Maximum

In [205]:
def find_max_single_datum(data:list,category:str) -> dict:
    max_datum = max(data, key = lambda datum: datum[category])
    max_value = max_datum[category]
    # min_value = min(iterable = data, key = lambda datum: datum[category])
    print('The maximum value for the {category} is {max_value} for example in the datum {max_datum}'.format(category = category, max_value = max_value, max_datum = max_datum))
    return max_value

def find_max_list_datum(data:list,category:str) -> list:
    max_value = 0
    max_datum = []
    for datum in data:
        value = datum[category]
        if value > max_value:
            max_datum = [datum]
            max_value = value
        elif value == max_value:
            max_datum.append(datum)
    print('The maximum value for the {category} is {max_value} in the datum {max_datum}'.format(category = category, max_value = max_value, max_datum = max_datum))
    return max_value

# find_max_list_datum(data = insurance_data, category = 'age')

def find_min_single(data:list,category:str) -> dict:
    min_datum = min(data, key = lambda datum: datum[category])
    min_value = min_datum[category]
    # min_value = min(iterable = data, key = lambda datum: datum[category])
    print('The minimum value for the {category} is {min_value} for example in the datum {min_datum}'.format(category = category, min_value = min_value, min_datum = min_datum))
    return min_value

print(find_min_single(data = insurance_data, category = 'bmi'))
print(find_max_single_datum(data = insurance_data, category = 'bmi'))




The minimum value for the bmi is 15.96 for example in the datum {'age': 18.0, 'sex': 0, 'bmi': 15.96, 'children': '0', 'smoker': 0, 'region': 'northeast', 'charges': 1694.7964}
15.96
The maximum value for the bmi is 53.13 for example in the datum {'age': 18.0, 'sex': 0, 'bmi': 53.13, 'children': '0', 'smoker': 0, 'region': 'southeast', 'charges': 1163.4627}
53.13


* Distinct values
* Count of distinct values

In [184]:
def distinct_values_sorted(data:list, category:str) -> list:
    distinct_values = set()
    for datum in data:
        distinct_values.add(datum[category])
    count_distinct = len(distinct_values)
    distinct_values = sorted(distinct_values)
    print('There are {count_distinct} distinct values in {category} category (sorted): {distinct_values}'.format(count_distinct=count_distinct,category=category,distinct_values=distinct_values))
    # # if output == 'count':
    # #     return count_distinct
    # elif output == 'values':
    return distinct_values

print(distinct_values_sorted(insurance_data, 'children'))


There are 6 distinct values in children category (sorted): ['0', '1', '2', '3', '4', '5']
['0', '1', '2', '3', '4', '5']


* Avarage
* Median
* Quartiles

In [206]:
#TODO decorator fot the category selection

def average(data:list, category:str):
    category_values = [datum[category] for datum in data]
    category_sum = sum(category_values)
    average = category_sum / len(category_values)
    print('Category {category} has average of {average:.1f}'.format(category=category, average=average))
    return average

average(insurance_data, 'bmi')

def percentile(data:list, category:str, percentile:float):
    category_values = [datum[category] for datum in data]
    sorted_values = sorted(category_values)
    percentile_index = (len(data)-1) * percentile 
    percentile_value = sorted_values[round(percentile_index)]
    print('The {percentile} percentile of the {category} category is {percentile_value:.1f}'.format(percentile=percentile, category=category, percentile_value=percentile_value))
    
    return percentile_value # approximation by rounding the index value

percentile(insurance_data, 'bmi', 0.25)

def median(data:list, category:str):
    return percentile(data, category, percentile=0.5)

median(insurance_data, 'bmi')

def first_quartile(data:list, category:str):
    return percentile(data, category, percentile=0.25)

def third_quartile(data:list, category:str):
    return percentile(data, category, percentile=0.75)

third_quartile(insurance_data, 'bmi')

def minimum(data:list, category:str):
    return percentile(data, category, percentile=0)

def maximum(data:list, category:str):
    return percentile(data, category, percentile=1)

minimum(insurance_data, 'bmi')
maximum(insurance_data, 'bmi')
find_max_single_datum(insurance_data, 'bmi')


Category bmi has average of 30.7
The 0.25 percentile of the bmi category is 26.3
The 0.5 percentile of the bmi category is 30.4
The 0.75 percentile of the bmi category is 34.7
The 0 percentile of the bmi category is 16.0
The 1 percentile of the bmi category is 53.1
The maximum value for the bmi is 53.13 for example in the datum {'age': 18.0, 'sex': 0, 'bmi': 53.13, 'children': '0', 'smoker': 0, 'region': 'southeast', 'charges': 1163.4627}


53.13

In [186]:
import pandas as pd
import numpy as np
import plotly as pl

In [187]:
#TODO correlation coeffiecients, ANOVA ... model

#TODO ouliers testing, robust estimators...Box plot!