# U.S. Med Insurance Costs Project

Author: ***Oleg Astafyev***

In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

Q1: What is the female/male ratio in the whole dataset? <br>

Q2: What is the BMI distribution within the whole dataset? <br>

Q3: What is the smoker/non-smoker ratio in the whole dataset? <br>

Q4: What is the childless/patients with children ratio in the whole dataset? <br>

Q5: Which age group is the largest? <br>

Q6: Which age group is the smallest? <br>

Q7: What is the BMI distribution within each age group? <br>

Q8: What is the female/male ratio in each age group? <br>

Q9: What is the smoker/non-smoker ratio in each age group? <br>

Q10: What is the childless/patients with children ratio in each age group? <br>

In [87]:
# import csv library
import csv

### In order to manipulate data in a more efficient manner it is first plausible to store columns in respective lists

# Initializing list variables for each column
age=[]
sex=[]
bmi=[]
children=[]
smoker=[]
region=[]
charges=[]

# Open the original file, parse and append column values to respective list variables

with open("insurance.csv", newline = "") as insurance_file:
    reader = csv.DictReader(insurance_file)
    
    for column in reader:
        age.append(column["age"])
        sex.append(column["sex"])
        bmi.append(column["bmi"])
        children.append(column["children"])
        smoker.append(column["smoker"])
        region.append(column["region"])
        charges.append(column["charges"])
        
# Create a dictionary where patient id would be the key and the value would be a dictionary 
# with keys "age", "sex" etc. and their respective values

patients = {}

for i in range(len(age)):
    
    patients[i] = {"Age":age[i],
                   "Sex":sex[i],
                   "BMI":bmi[i],
                   "Children":children[i],
                   "Smoker":smoker[i],
                   "Region":region[i],
                   "Charges":charges[i]}


# -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/

### Before starting the analysis it is important to narrow the scope, i.e. divide patients
### into some meaningful groups, e.g. by age or BMI. Since we are dealing with the US
### healthcare data, I rely on the age group definition provided on the CMS.gov website,
### BMI ranges provided on the CDC.gov website

### Since we know that the maximum age in our dataset is 64, I will only use 3 out of 5 
### suggested age groups: 0-18, 19-44, 45-64

# Age groups:

# Group1: 0-18
ageGroup1=[]

# Group2: 19-44
ageGroup2=[]

# Group3: 45-64
ageGroup3=[]

### The variables above are meant to be global, since the intention is to later have 
### a possibility to manipulate patient data within a specific group

# -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/

# Function to divide patients within a dataset by age groups and determine which one is the largest/smallest
# Takes a dataset as a single argument, sorts patients by age & prints out which age group is the largest/smallest

def divideByAge(patients):

    # Saving patient id into a list of the respective age group

    for patient in patients:
    
        if int(patients[patient]["Age"]) < 19:
            ageGroup1.append(patient)
        
        elif (int(patients[patient]["Age"]) >= 19) and (int(patients[patient]["Age"]) < 45):
            ageGroup2.append(patient)
        
        elif (int(patients[patient]["Age"]) >= 45) and (int(patients[patient]["Age"]) < 65):
            ageGroup3.append(patient)

    # Resulting "ageGroups" list containing three age group lists with patient ids
    ageGroups = ageGroup1, ageGroup2, ageGroup3
 
    # Now we can find out which age group is the largest and which one is the smallest
    
    # Storing patient count within each age group to a list
    patientNums = [len(ageGroups[0]), len(ageGroups[1]), len(ageGroups[2])]
    
    # Determining the index of the largest age group
    maxIndex = patientNums.index(max(patientNums))
    
    if maxIndex == 0:
        print("The 0-18 group is the largest in the provided dataset" + "\n")
        
    elif maxIndex == 1:
        print("The 19-44 group is the largest in the provided dataset" + "\n")
    
    else:
        print("The 45-64 group is the largest in the provided dataset" + "\n")
    
    # Determining the index of the smallest age group
    minIndex = patientNums.index(min(patientNums))
    
    if minIndex == 0:
        print("The 0-18 group is the smallest in the provided dataset" + "\n")
        
    elif minIndex == 1:
        print("The 19-44 group is the smallest in the provided dataset" + "\n")
    
    else:
        print("The 45-64 group is the smallest in the provided dataset" + "\n")
        
# -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/

# Function to divide patients within a dataset by BMI and determine the BMI distribution
# Takes a dataset as a single argument, sorts patients by BMI & prints out the BMI distribution within a given dataset

def bmiInfo(group):
    
    # BMI groups:
    
    # < 18.5        underweight
    underweight = []
    # 18.5 - 24.9   healthy
    healthy = []
    # 25.0 -29.9    overweight
    overweight = []
    # > 30.0        obese
    obese = []
    
    # Sorting patients by BMI
    for patient in group:
    
        if float(patients[patient]["BMI"]) < 18.5:
            underweight.append(patient)
    
        elif (float(patients[patient]["BMI"]) > 18.5) and (float(patients[patient]["BMI"]) < 25.0):
            healthy.append(patient)
    
        elif (float(patients[patient]["BMI"]) > 25.0) and (float(patients[patient]["BMI"]) < 30):
            overweight.append(patient)
    
        elif (float(patients[patient]["BMI"]) > 30.0):
            obese.append(patient)
    
    # Preparing BMI group member counts for printing
    uw = str(len(underweight))
    h = str(len(healthy))
    ow = str(len(overweight))
    ob = str(len(obese))
    
    # Printing BMI distribution
    print("There are " + uw + " underweight, " + h + " healthy, " + ow + " overweight, " + ob + " obese" + " patients in a given dataset." + "\n")
    
    print("Underweight patients comprise " + str(round((len(underweight)*100/len(group)))) + "%")
    
    print("Healthy patients comprise " + str(round((len(healthy)*100/len(group)))) + "%")
    
    print("Overweight patients comprise " + str(round((len(overweight)*100/len(group)))) + "%")
    
    print("Obese patients comprise " + str(round((len(obese)*100/len(group)))) + "%")
    
            
# -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/

# Function to determine the number of females/males and the female/male ratio in a given dataset 
# Takes a dataset as a single argument and prints out the female/male ratio

def sexInfo(group):
    
    # Initializing list variables for both sexes
    males = []
    females = []

    # Saving patient id into a list of the respective sex

    for patient in group:
    
        if patients[patient]["Sex"] == "female":
            females.append(patient)
        else:
            males.append(patient)
    
    # Printing sex ratio

    print("The female/male ratio is: " + str(len(females)) + "/" + str(len(males)))
    
    print("Females comprise " + str(round((len(females)*100/len(group)))) + "% of patients, whereas males " 
      + str(round((len(males)*100/len(group)))) + "%" + "\n")

# -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/

# Function to determine the number of smokers/non-smokers in the dataset and the smoker/non-smoker ratio
# Takes a dataset as a single argument and prints out the smoker/non-smoker ratio

def smokerInfo(group):

    # Initializing list variables for smokers and non-smokers
    smokers = []
    nonsmokers = []

    for patient in group:
    
        if patients[patient]["Smoker"] == "yes":
            smokers.append(patient)
        else:
            nonsmokers.append(patient)
        
        # Printing smoker/non-smoker ratio
        
    print("The smoker/non-smoker ratio is: " + str(len(smokers)) + "/" + str(len(nonsmokers)))
    
    print("Smokers comprise " + str(round((len(smokers)*100/len(group)))) + "% of patients, whereas non-smokers " 
          + str(round((len(nonsmokers)*100/len(group)))) + "%" + "\n")
    
# -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/

def childrenInfo(group):
    
    # Initializing list variables for 2 general categories: people with children and people without
    childless = []
    haveChildren = []
    
    # Initializing list variables for groups per children number.
    # Since it is known that the highest number of children in the whole dataset is 5, there should be 5 groups:
    childNum1 = []
    childNum2 = []
    childNum3 = []
    childNum4 = []
    childNum5 = []
    
    # Dividing patients into 2 general categories first
    
    for patient in group:
        
        if patients[patient]["Children"] == "0":
            childless.append(patient)
        else:
            haveChildren.append(patient)
            
    # Sorting people with children per number of children
    
    for patient in haveChildren:
        
        if patients[patient]["Children"] == "1":
            childNum1.append(patient)
            
        elif patients[patient]["Children"] == "2":
            childNum2.append(patient)
            
        elif patients[patient]["Children"] == "3":
            childNum3.append(patient)
            
        elif patients[patient]["Children"] == "4":
            childNum4.append(patient)
            
        elif patients[patient]["Children"] == "5":
            childNum5.append(patient)
    
    # Determining the most populated group per number of children
    
    childNums = [len(childNum1), len(childNum2), len(childNum3), len(childNum4), len(childNum5)]
    
    maxIndex = childNums.index(max(childNums))
                               
    if maxIndex == 0:
        print("The majority of patients in the provided dataset have 1 child" + "\n")
        
    elif maxIndex == 1:
        print("The majority of patients in the provided dataset have 2 children" + "\n")
    
    elif maxIndex == 2:
        print("The majority of patients in the provided dataset have 3 children" + "\n")
                               
    elif maxIndex == 3:
        print("The majority of patients in the provided dataset have 4 children" + "\n")
                               
    elif maxIndex == 4:
        print("The majority of patients in the provided dataset have 5 children" + "\n")
    
    # Determining the least populated group per number of children
                               
    minIndex = childNums.index(min(childNums))
    
    if minIndex == 0:
        print("The minority of patients in the provided dataset have 1 child" + "\n")
        
    elif minIndex == 1:
        print("The minority of patients in the provided dataset have 2 children" + "\n")
    
    elif minIndex == 2:
        print("The minority of patients in the provided dataset have 3 children" + "\n")
                               
    elif minIndex == 3:
        print("The minority of patients in the provided dataset have 4 children" + "\n")
                               
    elif minIndex == 4:
        print("The minority of patients in the provided dataset have 5 children" + "\n")
    
    
    # Printing the ratio of childless patients to patients with children
    
    print("The ratio of childless patients/patients with children is: " + str(len(childless)) + "/" + str(len(haveChildren)))
     
    print("Childless patients comprise " + str(round(len(childless)*100/len(group))) + "%, whereas patients with children " + str(round(len(haveChildren)*100/len(group))) + "%" + "\n")
    
    
# -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/
# -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/
# -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/
# -/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/

### Now is finally the time to get all the answers!

# Q1 answer:        

print("Q1:" + "\n")
sexInfo(patients)

#Q2 answer:

print("Q2:" + "\n")
bmiInfo(patients)
print("\n")

#Q3 answer:

print("Q3:" + "\n")
smokerInfo(patients)

#Q4 answer:
print("Q4:" + "\n")
childrenInfo(patients)

#Q5 & Q6 answers:

print("Q5 & Q6:" + "\n")
divideByAge(patients)

#Q7-Q10 answers:

print("Q7-Q10:" + "\n")
print("Age group 0-18:" + "\n")
bmiInfo(ageGroup1)
print("\n")
sexInfo(ageGroup1)
smokerInfo(ageGroup1)
childrenInfo(ageGroup1)

print("/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/" + "\n")
print("Age group 19-44:" + "\n")
bmiInfo(ageGroup2)
print("\n")
sexInfo(ageGroup2)
smokerInfo(ageGroup2)
childrenInfo(ageGroup2)

print("/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/" + "\n")
print("Age group 45-64:" + "\n")
bmiInfo(ageGroup3)
print("\n")
sexInfo(ageGroup3)
smokerInfo(ageGroup3)
childrenInfo(ageGroup3)

Q1:

The female/male ratio is: 662/676
Females comprise 49% of patients, whereas males 51%

Q2:

There are 20 underweight, 224 healthy, 384 overweight, 705 obese patients in a given dataset.

Underweight patients comprise 1%
Healthy patients comprise 17%
Overweight patients comprise 29%
Obese patients comprise 53%


Q3:

The smoker/non-smoker ratio is: 274/1064
Smokers comprise 20% of patients, whereas non-smokers 80%

Q4:

The majority of patients in the provided dataset have 1 child

The minority of patients in the provided dataset have 5 children

The ratio of childless patients/patients with children is: 574/764
Childless patients comprise 43%, whereas patients with children 57%

Q5 & Q6:

The 19-44 group is the largest in the provided dataset

The 0-18 group is the smallest in the provided dataset

Q7-Q10:

Age group 0-18:

There are 2 underweight, 11 healthy, 14 overweight, 42 obese patients in a given dataset.

Underweight patients comprise 3%
Healthy patients comprise 16%
Overw

<b> Sources for categorizing patient data:</b>

1.[Age groups defined by Centers for Medicare and Medicaid Services](https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/NationalHealthExpendData/Age-and-Gender) <br>
2.[BMI ranges defined by Centers for Disease Control and Prevention](https://www.cdc.gov/healthyweight/assessing/index.html)