In [27]:
import csv
import math
import numpy as np



Calculation of Mean, Median, Standard Deviation and Correlation

In [8]:
def calculate_mean(lst):
    return round(sum(lst) / len(lst))

def calculate_median(lst):
    lst.sort()
    n = len(lst)
    if n % 2 == 0:
        return (lst[n//2-1] + lst[n//2]) / 2
    else:
        return lst[n//2]

def calculate_std(lst,mean):
    variance = sum((x - mean) ** 2 for x in lst) / len(lst)
    return round(math.sqrt(variance))

def calcule_correlation(x,y):
    n = len(x)
    sum_x = sum(x)
    sum_y = sum(y)
    sum_xy = sum(i*j for i,j in zip(x,y))
    sum_x2 = sum(i**2 for i in x)
    sum_y2 = sum(j**2 for j in y)

    numerator = n * sum_xy - sum_x * sum_y
    denominator = ((n * sum_x2 - sum_x**2) * (n * sum_y2 - sum_y**2)) ** 0.5
    return numerator / denominator

Descriptive Statistics

In [3]:
ages = []
bmis = []
charges = []
with open("insurance.csv", newline="") as insurance_file:
    insurance_reader = csv.DictReader(insurance_file)
    for row in insurance_reader:
        ages.append(float(row["age"]))
        bmis.append(float(row["bmi"]))
        charges.append(float(row["charges"]))

mean_age = calculate_mean(ages)
median_age = calculate_median(ages)
std_age = calculate_std(ages,mean_age)

mean_bmi = calculate_mean(bmis)
median_bmi = calculate_median(bmis)
std_bmi = calculate_std(bmis,mean_bmi)

mean_charges = calculate_mean(charges)
median_charges = calculate_median(charges)
std_charges = calculate_std(charges,mean_charges)

print(f"Age - Mean: {mean_age}, Median: {median_age}, Standart Deviation: {std_age}")
print(f"BMI - Mean: {mean_bmi}, Median: {median_bmi}, Standart Deviation: {std_bmi}")
print(f"Charges - Mean: {mean_charges}, Median: {median_charges}, Standard Deviation: {std_charges}")

Age - Mean: 39, Median: 39.0, Standart Deviation: 14
BMI - Mean: 31, Median: 30.4, Standart Deviation: 6
Charges - Mean: 13270, Median: 9382.033, Standard Deviation: 12105


Age and Insurance Costs: The age of 39 indicates that you are generally in the middle age group before applying for insurance. Because the health risks of aging increase, this age group can be a critical one for insurance companies.

BMI and Insurance Costs: having an average BMI on the borderline of obesity and insurance costs having a high standard deviation may indicate that those with a higher BMI have more insurance costs due to health problems.
Risk Groups: Since the obesity rate is high and the standard deviation is high, the costs of younger and normal-weight individuals may be low, while the costs of elderly and obese individuals may be quite high.

Differences in Insurance Costs: A wide distribution in insurance costs indicates large differences between individuals. The duration of these types usually depends on variables such as smoking, age and BMI.
Insurance Premium and Risk Management: For insurance companies, this data set may mean that they can identify risky groups with high costs and apply higher premiums to them.


Insurance - BMI correlation relationship

In [4]:
bmis = []
charges = []

with open("insurance.csv",newline="") as insurance_file:
    insurance_reader = csv.DictReader(insurance_file)
    for row in insurance_reader:
        bmis.append(float(row["bmi"]))
        charges.append(float(row["charges"]))

correlation_coefficient = calcule_correlation(bmis,charges)
print(f"Correlation Coefficient Between BMI and Charges {correlation_coefficient}")

Correlation Coefficient Between BMI and Charges 0.1983409688336287


A correlation coefficient of 0.1983 indicates a positive relationship between BMI and insurance costs. However, this relationship is at a weak level.

Weak Positive Correlation: The correlation coefficient is around 0.2, indicating that insurance costs tend to increase with increasing BMI, but this relationship is not strong. This shows that BMI works together with other factors that affect insurance costs.

Insurance - Age correlation relationship

In [5]:
ages = []
charges = []

with open("insurance.csv",newline="") as insurance_file:
    insurance_reader = csv.DictReader(insurance_file)
    for row in insurance_reader:
        ages.append(float(row["age"]))
        charges.append(float(row["charges"]))

correlation_coefficient = calcule_correlation(ages,charges)
print(f"Correlation Coefficient Between Ages and Charges {correlation_coefficient}")

Correlation Coefficient Between Ages and Charges 0.2990081933306475


Positive Correlation: If this coefficient is positive, it means that insurance costs increase with age. This means that older individuals generally tend to have higher insurance costs.

Moderate Correlation: The coefficient being around 0.3 indicates that this relationship is at a moderate level. This means that age is an important factor affecting insurance costs, but it is not a sufficient determinant alone.

Health Risk and Cost: As we get older, the likelihood of health problems generally increases. This can lead to more medical care and therefore higher insurance costs.

Insurance - Num of Children correlation relationship

In [6]:
num_children = []
charges = []

with open("insurance.csv",newline="") as insurance_file:
    insurance_reader = csv.DictReader(insurance_file)
    for row in insurance_reader:
        num_children.append(float(row["children"]))
        charges.append(float(row["charges"]))

correlation_coefficient = calcule_correlation(ages,charges)
print(f"Correlation Coefficient Between Num Children and Charges {correlation_coefficient}")

Correlation Coefficient Between Num Children and Charges 0.2990081933306475


Positive Correlation: If this coefficient is positive, it means that insurance costs increase as the number of children increases. This means that individuals who have more children generally tend to have higher insurance costs.

Moderate Correlation: The coefficient being around 0.3 indicates that this relationship is at a moderate level. This indicates that the number of children is an important factor affecting insurance costs, but is not a sufficient determinant alone.

Impact of Family Structure: Having more children is generally associated with more healthcare expenses and therefore higher insurance costs. Families' needs for health care may increase, which may affect insurance premiums.


Women-Men Insurance Relationship

In [26]:
males = []
females = []

with open("insurance.csv",newline="") as insurance_file:
    insurance_reader = csv.DictReader(insurance_file)
    for row in insurance_reader:
        if row["sex"] == "male":
            males.append(float(row["charges"]))
        else:
            females.append(float(row["charges"]))
    
mean_male = calculate_mean(males)
mean_female = calculate_mean(females)

print(f"Average Insurance Cost for males {mean_male}")
print(f"Average Insurance Cost for females {mean_female}")

Average Insurance Cost for males 13957
Average Insurance Cost for females 12570


Men's insurance costs are, on average, higher than women's. This may indicate that men generally have more healthcare expenses in health insurance calculations or that their risk profile is higher than women.

Smoker-non-smoker insurance relationship

In [8]:
smoker_charges = []
non_smoker_charges = []

with open("insurance.csv", newline="") as insurance_file:
    insurance_reader = csv.DictReader(insurance_file)
    for row in insurance_reader:
        if row["smoker"] == "yes":
            smoker_charges.append(float(row["charges"]))
        else:
            non_smoker_charges.append(float(row["charges"]))

mean_smoker_charges = calculate_mean(smoker_charges)
mean_non_smoker_charges = calculate_mean(non_smoker_charges)

print(f"Average Insurance Cost for Smokers {mean_smoker_charges}")
print(f"Average Insurance Cost for Non-Smokers {mean_non_smoker_charges}")

difference = mean_smoker_charges - mean_non_smoker_charges
print(f"Insurance Difference Between Smokers and Non-Smokers {difference}")




Average Insurance Cost for Smokers 32050
Average Insurance Cost for Non-Smokers 8434
Insurance Difference Between Smokers and Non-Smokers 23616


Health and Financial Impact: These data show that smoking is not only limited to its negative effects on health, but also increases individuals' financial liabilities.

Policy Development by Insurance Companies: Insurance companies can manage their healthy risks by applying higher premiums to smokers based on such data. They can also encourage healthy lifestyles by making insurance costs more attractive to non-smokers.

Awareness for Public Health: Such analyzes can raise awareness about not only the individual health but also the financial consequences of smoking. Lower insurance costs for nonsmokers can help promote healthy lifestyles.

Sex and Smoker Status Relationship

In [6]:
num_males_smokers = 0
num_males_non_smokers = 0
num_females_smokers = 0
num_females_non_smokers = 0

with open("insurance.csv",newline="") as insurance_file:
    insurance_reader = csv.DictReader(insurance_file)
    for row in insurance_reader:
        sex = row["sex"]
        smoker = row["smoker"]

        if sex == "male" and smoker == "yes":
            num_males_smokers += 1
        elif sex == "male" and smoker == "no":
            num_males_non_smokers += 1
        elif sex == "female" and smoker == "yes":
            num_females_smokers += 1
        elif sex == "female" and smoker == "no":
            num_females_non_smokers += 1

print("Number of men who smoke ",num_males_smokers)
print("Number of men who non-smoke ",num_males_non_smokers)
print("Number of women who smoke ",num_females_smokers)
print("Number of women who non-smoke ",num_females_non_smokers)


Number of men who smoke  159
Number of men who non-smoke  517
Number of women who smoke  115
Number of women who non-smoke  547


These results reveal that the smoking rate of men is higher than that of women. In particular, while 23% of men smoke, only 17% of women smoke. This suggests that social and cultural factors may influence smoking habits across genders. Additionally, the high number of non-smokers in both genders may indicate that awareness of healthy living is increasing and smoking is decreasing.

Regional Insurance Differences

In [9]:
southwest_region = []
northwest_region = []
southeast_region = []
northeast_region = []

with open("insurance.csv",newline="") as insurance_file:
    insurance_reader = csv.DictReader(insurance_file)
    for row in insurance_reader:
        if row["region"] == "southwest":
            southwest_region.append(float(row["charges"]))
        elif row["region"] == "northwest":
            northwest_region.append(float(row["charges"]))
        elif row["region"] == "southeast":
            southeast_region.append(float(row["charges"]))
        else:
            northeast_region.append(float(row["charges"]))

mean_sw = calculate_mean(southwest_region)
mean_nw = calculate_mean(northwest_region)
mean_se = calculate_mean(southeast_region)
mean_ne = calculate_mean(northeast_region)

print(f"Average Insurance Cost for southwest region {mean_sw}")
print(f"Average Insurance Cost for northwest region {mean_nw}")
print(f"Average Insurance Cost for southeast region {mean_se}")
print(f"Average Insurance Cost for northeast region {mean_ne}")

Average Insurance Cost for southwest region 12347
Average Insurance Cost for northwest region 12418
Average Insurance Cost for southeast region 14735
Average Insurance Cost for northeast region 13406


Southwest Region (Average Insurance Cost: 12,347): The average cost of insurance in this region is the lowest compared to other regions. This indicates that the structure of healthcare services, lifestyles or insurance policies in the region may be more suitable.

Northwest Region (Average Insurance Cost: 12,418): It has an average that is quite close to the Southwest. This similarity suggests that healthcare services and demographic structure in the region may be similar.

Southeast Region (Average Insurance Cost: 14,735): Insurance costs in this region are higher than the other three regions. This may be associated with difficulties accessing healthcare, higher health risks, or greater prevalence of health problems in the region.

Northeast Region (Average Insurance Cost: 13,406): The average insurance cost in this region remains below the southeast region, but is higher than other regions. The fact that the Northeast is more densely populated and offers different health services may be among the factors affecting costs.

Finding outliers in the charges column using IQR and Z-score methods

In [11]:
charges = []

with open("insurance.csv",newline="") as insurance_file:
    insurance_reader = csv.DictReader(insurance_file)
    for row in insurance_reader:
        charges.append(float(row["charges"]))

def find_outliers_iqr(data):
    Q1 = np.percentile(data,25)
    Q3 = np.percentile(data,75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = [i for i in data if i < lower_bound or i > upper_bound]
    return outliers

def find_outliers_zscore(data):
    mean = np.mean(data)
    std_dev = np.std(data)
    outliers = [i for i in data if (i-mean) / std_dev > 3 or (i-mean) / std_dev < -3]
    return outliers

outliers_iqr = find_outliers_iqr(charges)
outliers_zscore = find_outliers_zscore(charges)

print("Outliers with the IQR method: ",outliers_iqr)
print("Outliers with Z-score method: ",outliers_zscore)

Outliers with the IQR method:  [39611.7577, 36837.467, 37701.8768, 38711.0, 35585.576, 51194.55914, 39774.2763, 48173.361, 38709.176, 37742.5757, 47496.49445, 37165.1638, 39836.519, 43578.9394, 47291.055, 47055.5321, 39556.4945, 40720.55105, 36950.2567, 36149.4835, 48824.45, 43753.33705, 37133.8982, 34779.615, 38511.6283, 35160.13457, 47305.305, 44260.7499, 41097.16175, 43921.1837, 36219.40545, 46151.1245, 42856.838, 48549.17835, 47896.79135, 42112.2356, 38746.3551, 42124.5153, 34838.873, 35491.64, 42760.5022, 47928.03, 48517.56315, 41919.097, 36085.219, 38126.2465, 42303.69215, 46889.2612, 46599.1084, 39125.33225, 37079.372, 35147.52848, 48885.13561, 36197.699, 38245.59327, 48675.5177, 63770.42801, 45863.205, 39983.42595, 45702.02235, 58571.07448, 43943.8761, 39241.442, 42969.8527, 40182.246, 34617.84065, 42983.4585, 42560.4304, 40003.33225, 45710.20785, 46200.9851, 46130.5265, 40103.89, 34806.4677, 40273.6455, 44400.4064, 40932.4295, 40419.0191, 36189.1017, 44585.45587, 43254.41795, 

Both methods are effective for detecting outliers in insurance premiums, but produce different results. While IQR flags a larger set of data as outliers, Z-score detects only the most extreme points. This difference should be taken into account during the analysis and interpretation of the data set.

Additionally, the causes of outliers should be investigated and how these may impact risk assessments in insurance policies should be evaluated. This process can help set more equitable and effective insurance premiums.