<a href="https://colab.research.google.com/github/luis-telesforo/Cleaning-Data/blob/main/basic_statistics_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Basic statistics with pandas
We use 2017 data on immunizations from the Centers for Disease Control and Prevention (CDC).

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np

Data provided by U-M. It contains 28,465 rows and 454 columns

In [6]:
df = pd.read_csv("/content/drive/MyDrive/bases de datos michigan/NISPUF17.csv")
df.shape

(28465, 454)

In [7]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,SEQNUMC,SEQNUMHH,PDAT,PROVWT_D,RDDWT_D,STRATUM,YEAR,AGECPOXR,HAD_CPOX,...,XVRCTY2,XVRCTY3,XVRCTY4,XVRCTY5,XVRCTY6,XVRCTY7,XVRCTY8,XVRCTY9,INS_STAT2_I,INS_BREAK_I
0,1,128521,12852,2,,235.916956,1031,2017,,2,...,,,,,,,,,,
1,2,10741,1074,2,,957.35384,1068,2017,,2,...,,,,,,,,,,
2,3,220011,22001,2,,189.611299,1050,2017,,2,...,,,,,,,,,,


## Filtering data and calculating frequencies
The function `proportion_of_education` returns the proportion of children in the dataset who had a mother with the education levels equal to less than high school, high school, more than high school but not a college graduate and college degree.

In [9]:
#The column we are interested in is 'EDUC1'

#Categories are distinguished as 1,2,3 and 4
#they correspond to less than high school, high school, 
#more than high school but not a college graduate 
#and college degree respectively.

#There is no mising value
def proportion_of_education():
  le_12 = df[df['EDUC1']==1]['EDUC1']
  eq_12 = df[df['EDUC1']==2]['EDUC1']
  gr_12 = df[df['EDUC1']==3]['EDUC1']
  college = df[df['EDUC1']==4]['EDUC1']
  
  le_12_rf = len(le_12)/len(df)
  eq_12_rf = len(eq_12)/len(df)
  gr_12_rf = len(gr_12)/len(df)
  college_rf = len(college)/len(df)
  return {"less than high school":le_12_rf,"high school":eq_12_rf,"more than high school but not college":gr_12_rf,"college":college_rf}

proportion_of_education()

{'college': 0.47974705779026877,
 'high school': 0.172352011241876,
 'less than high school': 0.10202002459160373,
 'more than high school but not college': 0.24588090637625154}

## Relationship between to variables

We explore the relationship between being fed breastmilk as a child and getting a seasonal influenza vaccine from a healthcare provider. 

In [11]:
#The column 'CBF_01' contains the information
#about breastfed children: 1-yes, 2-no

#Column 'P_NUMFLU' contains the number of 
#flu vaccines

#there are some NaN. We drop it to calculate
#the average

def average_influenza_doses():
  BF = df[df['CBF_01']==1]
  NBF = df[df['CBF_01']==2]
  BF_FLU = np.sum(BF['P_NUMFLU'])/len(BF['P_NUMFLU'].dropna())
  NBF_FLU = np.sum(NBF['P_NUMFLU'])/len(NBF['P_NUMFLU'].dropna())
  return (BF_FLU,NBF_FLU)
average_influenza_doses()

(1.8799187420058687, 1.5963945918878317)

It would be interesting to see if there is any evidence of a link between vaccine effectiveness and sex of the child. So, we Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it (at least one varicella dose) versus those who were vaccinated but did not contract chicken pox. 

In [12]:
# Column "P_NUMVRC" has the number of chicken pox vaccines
def chickenpox_by_sex():
  var = df[df["P_NUMVRC"]>=1]
  yes_var_female = len(var[(var['HAD_CPOX']==1)&(var['SEX']==2)])
  no_var_female = len(var[(var['HAD_CPOX']==2)&(var['SEX']==2)])
  yes_var_male = len(var[(var['HAD_CPOX']==1)&(var['SEX']==1)])
  no_var_male = len(var[(var['HAD_CPOX']==2)&(var['SEX']==1)])
  return {'male':yes_var_male/no_var_male,'female':yes_var_female/no_var_female}
chickenpox_by_sex()

{'female': 0.0077918259335489565, 'male': 0.009675583380762664}

## Correlation
We wanted to know if chicken pox vaccines work, thus we calculate the correlation between the use of the vaccine and whether it results in prevention of the infection or disease (the analysis we do here is not a rigurous one, we only want to show how to handle datasets using pandas)

In [14]:
# Column "P_NUMVRC" has some NaN
import scipy.stats as stats

def corr_chickenpox():

  DF = pd.DataFrame(df['P_NUMVRC'])
  DF['HAD_CPOX'] = df['HAD_CPOX']
  DF = DF[DF['HAD_CPOX'].isin([1,2])]
  DF= DF.dropna()
  corr, pval=stats.pearsonr(DF['HAD_CPOX'],DF['P_NUMVRC'])

  return corr
corr_chickenpox()

0.07044873460148046