Setting up a fancy stylesheet

In [1]:
from IPython.core.display import HTML
css_file = 'style.css'
HTML(open(css_file, 'r').read())

Setting up the required python &#8482; environment

In [2]:
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import bayes_mvs
from math import factorial
import scikits.bootstrap as bs
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
%matplotlib inline
sns.set_style('whitegrid')
sns.set_context('paper', font_scale = 2.0, rc = {'lines.linewidth': 1.5, 'figure.figsize' : (10, 8)})
filterwarnings('ignore')

# Comparing categorical data

## Introduction

<p>In the previous chapter we created two groups.  These groups were categorical in nature.  Group A was those without histological evidence of appendicitis and group B included those with evidence of appendicitis.  Furthermore, these two groups are nominal.  There is no order to with and without appendicitis.</p>
<p>The variable we examined was white cell count.  This represented continuous data of the ratio numerical type.  Now, let's construct two groups, once again of nominal ordinal type, but this time the data variable is also categorical.</p>

## Importing and examining the dataset

In [3]:
data = pd.read_csv('MOOC_Mock.csv')
data.head()

Unnamed: 0,File,Age,Gender,Delay,Stay,ICU,RVD,CD4,HR,Temp,CRP,WCC,HB,Rupture,Histo,Comp,MASS
0,1,38,Female,3,6,No,No,,97,35.2,,10.49,10.4,No,Yes,Yes,5
1,2,32,Male,6,10,No,Yes,57.0,109,38.8,45.3,7.08,19.8,No,No,Yes,8
2,3,19,Female,1,16,No,No,,120,36.3,10.7,13.0,8.7,No,No,No,3
3,4,20,Female,2,9,No,Yes,,120,35.7,77.8,4.45,8.8,No,No,No,0
4,5,28,Female,3,3,No,Yes,491.0,115,37.1,51.6,21.98,13.4,No,Yes,No,7


## The chi-squared (*&#935;<sup>2</sup>*) test

<p>For this example I want to know if there is a difference in the incidence of histologically proven appendicitis between those with and without retroviral disease (RVD).</p>
<p>Think about it for a moment.  I will have two groups: Group A without appendicitis and Group B with. Within each of these I will have two groups: Group I without RVD and Group II with.  Form this I can create a little table called a *contingency table*.  In this example the table will have two rows and two columns.  For a &#935;*<sup>2</sup>* test, the table can have more rows and more columns.</p>
<p>In order to do this test, we will have to get the values to fill in our column, manually.</p>

In [4]:
histo_group = data.groupby(data['Histo'])
histo_group['RVD'].value_counts()

Histo  RVD
No     No     16
       Yes    14
Yes    No     80
       Yes    40
dtype: int64

<p>Here, I have made use of the powerful *.groupby()* function.  It splits my DataFrame into parts according to the values found in the specified column (here I chose **Histo**).  I attached the new (split) DataFrame to a computer variable name *histo_group*.</p>
<p>I then moved on and asked the software to give me the value counts for the **RVD** column of this new DataFrame.  Note how the results tell me that the DataFrame is split by **Histo** and then gives me a breakdown of results found in the requested column.  It found values for *yes* and *No* and told me how many of each it found.</p>

### Creating the contingency table (matrix)

<p>Now we have to construct our little two-by-two matrix.  Here I will use *numpy*.  Note the use of square brackets.</p>

In [5]:
histo_RVD_observed = np.array([[16, 14], [80, 40]])
histo_RVD_observed

array([[16, 14],
       [80, 40]])

<p>From the *scipy.stats* library I am going to use the *chi2_contingency()* function.  It takes a single argument (my observed table above) and returns for values, hence my use of four computer variable names before the equal sign.  They are:
* The &#935;*<sup>2</sup>* value
* The *p*-value
* The degrees of freedom
* The expected table

In [6]:
chi_val, p, df, expected = stats.chi2_contingency(histo_RVD_observed)

<p>Let's print each of these to the screen.</p>

In [7]:
chi_val

1.3183593750000002

In [8]:
p

0.25088670393543944

In [9]:
df

1

In [10]:
expected

array([[ 19.2,  10.8],
       [ 76.8,  43.2]])

<p>So, our *p*-value was more than *0.05* and we can say that there was no difference in the rate of RVD between those with and without appendicitis.<p>
<p>The expected table is quite interesting.  It calculates what we would have expected given the total number in each group.</p>