# Data Exploration : National Immunization Survey-Child 2017 Dataset
For this assignment I worked with 2017 data on immunizations from the CDC. Datafile for this assignment is in [assets/NISPUF17.csv](assets/NISPUF17.csv). A data users guide for this, which was used to map the variables in the data to the questions being asked, is available at [assets/NIS-PUF17-DUG.pdf](assets/NIS-PUF17-DUG.pdf).

## Part A
Question: Write a function called `proportion_of_education` which returns the proportion of children in the dataset who had a mother with the education levels equal to less than high school (<12), high school (12), more than high school but not a college graduate (>12) and college degree.

*This function should return a dictionary in the form of (use the correct numbers, do not round numbers):* 
```
    {"less than high school":0.2,
    "high school":0.4,
    "more than high school but not college":0.2,
    "college":0.2}
```


In [1]:
import pandas as pd

def proportion_of_education():
    df = pd.read_csv("assets/NISPUF17.csv", index_col=0)
    df = df[['SEQNUMC','EDUC1']]

    result = {}
    keys = ["less than high school", "high school", "more than high school but not college", "college"]


    for i in range(1,len(keys)+1):
        proportion = df.where(df['EDUC1']==i).count()[0]/len(df)  # Calculating Percentage
        key = keys[i-1]
        result[key] = proportion       # Storing Result in the dictionary
        
    return result


In [2]:
proportion_of_education()

{'less than high school': 0.10202002459160373,
 'high school': 0.172352011241876,
 'more than high school but not college': 0.24588090637625154,
 'college': 0.47974705779026877}

## Part B

Let's explore the relationship between being fed breastmilk as a child and getting a seasonal influenza vaccine from a healthcare provider.
Question: Return a tuple of the average number of influenza vaccines for those children we know received breastmilk as a child and those who know did not.

*This function should return a tuple in the form (use the correct numbers:*
```
(2.5, 0.1)
```

In [3]:
import pandas as pd

def average_influenza_doses():
    df = pd.read_csv("assets/NISPUF17.csv", index_col=0)
    df = df[['CBF_01', 'P_NUMFLU']].dropna()

    result = (df[df['CBF_01']==1]['P_NUMFLU'].sum()/df[df['CBF_01']==1]['P_NUMFLU'].count(), df[df['CBF_01']==2]['P_NUMFLU'].sum()/df[df['CBF_01']==2]['P_NUMFLU'].count())

    return result


In [4]:
average_influenza_doses()

(1.8799187420058687, 1.5963945918878317)

## Part C
It would be interesting to see if there is any evidence of a link between vaccine effectiveness and sex of the child.
Question: Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it (at least one varicella dose) versus those who were vaccinated but did not contract chicken pox. Return results by sex. 

*This function should return a dictionary in the form of (use the correct numbers):* 
```
    {"male":0.2,
    "female":0.4}
```


In [5]:
import pandas as pd

def chickenpox_by_sex():
    df = pd.read_csv("assets/NISPUF17.csv", index_col=0)
    df = df[['SEX', 'HAD_CPOX', 'P_NUMVRC']].dropna()

    male_ratio = df[(df['SEX']==1) & (df['HAD_CPOX']==1) & (df['P_NUMVRC']>0)].count()[0] / df[(df['SEX']==1) & (df['HAD_CPOX']==2) & (df['P_NUMVRC']>0)].count()[0]

    female_ratio = df[(df['SEX']==2) & (df['HAD_CPOX']==1) & (df['P_NUMVRC']>0)].count()[0] / df[(df['SEX']==2) & (df['HAD_CPOX']==2) & (df['P_NUMVRC']>0)].count()[0]

    result = {"male": male_ratio, "female": female_ratio }
    return result



In [6]:
chickenpox_by_sex()

{'male': 0.009675583380762664, 'female': 0.0077918259335489565}

## Part D
A correlation is a statistical relationship between two variables. If we wanted to know if vaccines work, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease.
In this question, we are to see if there is a correlation between having had the chicken pox and the number of chickenpox vaccine doses given (varicella).

Some notes on interpreting the answer. The `had_chickenpox_column` is either `1` (for yes) or `2` (for no), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A positive correlation (e.g., `corr > 0`) means that an increase in `had_chickenpox_column` (which means more noâ€™s) would also increase the values of `num_chickenpox_vaccine_column` (which means more doses of vaccine). If there is a negative correlation (e.g., `corr < 0`), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.


In [7]:
def corr_chickenpox():
    import scipy.stats as stats
    import numpy as np
    import pandas as pd
    
    df = pd.read_csv("assets/NISPUF17.csv", index_col=0)
    
    df = df[['HAD_CPOX', 'P_NUMVRC']].dropna()
    df = df[df['HAD_CPOX']!=77]
    

    corr, pval=stats.pearsonr(df['HAD_CPOX'],df['P_NUMVRC'])
    
    return corr


In [8]:
corr_chickenpox()

0.07044873460147855