# Assigment 2
For this assignment we'll be looking at 2017 data on immunizations from the CDC. Our datafile for this assignment is in [Datasets/NISPUF17.csv](Datasets/NISPUF17.csv). A data users guide for this, which we'll need to map the variables in the data to the questions being asked, is available at [Datasets/NIS-PUF17-DUG.pdf](Datasets/NIS-PUF17-DUG.pdf).

## Question 1
Here we are going to write a function called `proportion_of_education` which returns the proportion of children in the dataset who had a mother with the education levels equal to less than high school (<12), high school (12), more than high school but not a college graduate (>12) and college degree.

*This function should return a dictionary in the form of (use the correct numbers, do not round numbers):* 
```
    {"less than high school":0.2,
    "high school":0.4,
    "more than high school but not college":0.2,
    "college":0.2}
```

In [1]:
# let's start by bringing in pandas and numpy
import pandas as pd
import numpy as np

In [2]:
def proportion_of_education():
    
    df = pd.read_csv('Datasets/NISPUF17.csv', index_col=0)
    # According to NIS-PUF17-DUG.pdf, the column 'EDUC1' has the information related to the mothers' 
    # educational level
    df = df['EDUC1']
    
    survey = {"less than high school":0.0,
              "high school":0.0,
              "more than high school but not college":0.0,
              "college":0.0}
    
    n = len(df)
    
    # Here 1 represents mothers which educational level is less than high school, 2 does who studied the whole
    # high school, and so on
    survey['less than high school'] = np.sum(df==1)/n
    survey['high school'] = np.sum(df==2)/n
    survey['more than high school but not college'] = np.sum(df==3)/n
    survey['college'] = np.sum(df==4)/n
    
    return survey

assert type(proportion_of_education())==type({}), "We must return a dictionary."
assert len(proportion_of_education()) == 4, "We have not returned a dictionary with four items in it."
assert "less than high school" in proportion_of_education().keys(), "We have not returned a dictionary with the correct keys."
assert "high school" in proportion_of_education().keys(), "We have not returned a dictionary with the correct keys."
assert "more than high school but not college" in proportion_of_education().keys(), "We have not returned a dictionary with the correct keys."
assert "college" in proportion_of_education().keys(), "We have not returned a dictionary with the correct keys."

In [3]:
proportion_of_education()

{'less than high school': 0.10202002459160373,
 'high school': 0.172352011241876,
 'more than high school but not college': 0.24588090637625154,
 'college': 0.47974705779026877}

## Question 2
Let's explore the relationship between being fed breastmilk as a child and getting a seasonal influenza vaccine from a healthcare provider. Return a tuple of the average number of influenza vaccines for those children we know received breastmilk as a child and those who know did not.

*This function should return a tuple in the form (use the correct numbers:*
```
(2.5, 0.1)
```

In [4]:
def average_influenza_doses():
    
    df = pd.read_csv('Datasets/NISPUF17.csv', index_col=0)
    # CBF_01 has the information about whether or not a child ever fed breast milk (this is depicted by 1 and 2,
    # repectively), or whether the information is unkown or missing (3 and 4).
    # On the other hand, P_NUMFLU tells us about the total number of seasonal influenza doses
    df1 = df[['CBF_01', 'P_NUMFLU']]
    df1 = df1.dropna() # let's drop the NaN values
    
    # here we are interested on fed breast milk children
    fed = df1[(df1['CBF_01']==1) & (df1['P_NUMFLU']>=0)]
    fed = np.array(fed['P_NUMFLU'])
    fed_av = np.sum(fed)/len(fed)
    
    # and here we are interested on NOT fed breast milk children
    no_fed = df1[(df1['CBF_01']==2) & (df1['P_NUMFLU']>=0)]
    no_fed = np.array(no_fed['P_NUMFLU'])
    no_fed_av = np.sum(no_fed)/len(no_fed)
    
    # Finally we get the mean value of the number of influenza vaccines for each group we consider
    average = (fed_av, no_fed_av)
    
    return average

assert len(average_influenza_doses())==2, "Return two values in a tuple, the first for yes and the second for no."

In [5]:
average_influenza_doses()

(1.8799187420058687, 1.5963945918878317)

## Question 3
It would be interesting to see if there is any evidence of a link between vaccine effectiveness and sex of the child. Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it (at least one varicella dose) versus those who were vaccinated but did not contract chicken pox. Return results by sex. 

*This function should return a dictionary in the form of (use the correct numbers):* 
```
    {"male":0.2,
    "female":0.4}
```

In [6]:
def chickenpox_by_sex():
    
    df = pd.read_csv('Datasets/NISPUF17.csv', index_col=0)
    # In HAD_CPOX, 1 represents who contracted CPOX and 2 the opposite
    df1 = df[['SEX', 'P_NUMVRC', 'HAD_CPOX']]
    
    # Here are males who has one dose or more, and contracted CPOX
    male_1 = df1[(df1['SEX']==1) & (df1['P_NUMVRC']>=1) & (df1['HAD_CPOX']==1)]
    male_1 = len(male_1['SEX'])
    # Here are males who has one dose or more, and NOT contracted CPOX
    male_2 = df1[(df1['SEX']==1) & (df1['P_NUMVRC']>=1) & (df1['HAD_CPOX']==2)]
    male_2 = len(male_2['SEX'])
    # Here the same but for women
    female_1 = df1[(df1['SEX']==2) & (df1['P_NUMVRC']>=1) & (df1['HAD_CPOX']==1)]
    female_1 = len(female_1['SEX'])
    
    female_2 = df1[(df1['SEX']==2) & (df1['P_NUMVRC']>=1) & (df1['HAD_CPOX']==2)]
    female_2 = len(female_2['SEX'])
    # and finally we compute the ratio between who had and not CPOX for males and females
    result = {"male": male_1/male_2,
              "female": female_1/female_2}
    
    return result

assert len(chickenpox_by_sex())==2, "Return a dictionary with two items, the first for males and the second for females."

In [7]:
chickenpox_by_sex()

{'male': 0.009675583380762664, 'female': 0.0077918259335489565}

## Question 4
If we wanted to know if vaccines work, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease [1]. In this question, you are to see if there is a correlation between having had the chicken pox and the number of chickenpox vaccine doses given (varicella).

A positive correlation (e.g., `corr > 0`) means that an increase in `had_chickenpox_column` (which means more no’s) would also increase the values of `num_chickenpox_vaccine_column` (which means more doses of vaccine). If there is a negative correlation (e.g., `corr < 0`), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Also, `pval` is the probability that we observe a correlation between `had_chickenpox_column` and `num_chickenpox_vaccine_column` which is greater than or equal to a particular value occurred by chance. A small `pval` means that the observed correlation is highly unlikely to occur by chance.

In [8]:
# to deal with correlation topics we bring in scipy
import scipy.stats as stats

In [9]:
def corr_chickenpox():
    
    df = pd.read_csv('Datasets/NISPUF17.csv', index_col=0)
    df = df[['P_NUMVRC','HAD_CPOX']]
    df = df.dropna()
    # Two make the correlation we only take who had 2 or less doses
    df = df[df['HAD_CPOX']<=2]
    
    # let's convert our data into a DataFrame
    df1 = pd.DataFrame({"had_chickenpox_column": df['HAD_CPOX'],
                        "num_chickenpox_vaccine_column": df['P_NUMVRC']})

    # here is some stub code to actually run the correlation
    corr, pval = stats.pearsonr(df1["had_chickenpox_column"],df1["num_chickenpox_vaccine_column"])
    
    return corr, pval

corr, pval = corr_chickenpox()
assert -1<= corr <=1, "You must return a float number between -1.0 and 1.0."

In [10]:
# pval must be very very small
corr_chickenpox()

(0.07044873460148016, 2.7780263182891086e-18)