# Introduction

When dealing with real data, we often have to "clean" or manage the data a little bit before we are able to perform our analysis. For us, basic data hygine will consist of dealing with
* Renaming variables
* Dealing with non-responses
* Managing time consistantly (Time can be complicated)

To begin, as we always will, we will import the needed packages and shorten their names. 

#Header Block

In [1]:
# This header will be the same no matter what code you are using
# import modules that we will use multiple functions from and give them short names. 

import pandas as pd;
import numpy as np;
import seaborn as sb;
import matplotlib.pyplot as plt;


# import single functions

from scipy.stats.contingency import chi2_contingency;
from itertools import combinations;
from statsmodels.graphics.mosaicplot import mosaic;
from scipy.stats.contingency import chi2_contingency;
from scipy.stats import pearsonr;


# Data Management Block


In [2]:
# This is the URL to a copy of the ADD Health Wave IV Data

addhealth_url = 'https://drive.google.com/uc?export=download&id=1LOoZl4utpqTfKjj6nu70RH16frFLyPfm'


# Here is where we list the variables that we will use with their names

myData = pd.read_csv(addhealth_url,usecols=['H4TO3','H4ID5C'],low_memory=False)

# Rename the variables 

myData.rename(columns={
    'H4TO3':'regular_smoker',
    'H4ID5C':'high_bp', 
},inplace=True)

# manage non-responses and legitiamate skips

myData['regular_smoker'].replace({
     6:np.nan,
     7:0,
     8:np.nan,
},inplace=True)

myData['high_bp'].replace({
    6:np.nan,
    8:np.nan,
},inplace=True)

# add text labels 

myData['regular_smoker'].replace({
    0:'No',
    1:'Yes'
},inplace=True)

myData['high_bp'].replace({
    0:'No',
    1:'Yes'
},inplace=True)

Taking a look at the this variable we see that the data is encoded as numbers, even though the variable is categorical. We will look at what those mean later. 

In [3]:
myData

Unnamed: 0,high_bp,regular_smoker
0,Yes,Yes
1,No,No
2,No,No
3,No,No
4,No,No
...,...,...
5109,No,No
5110,No,Yes
5111,No,No
5112,No,Yes


In [4]:
myData['high_bp'].value_counts()

No     4558
Yes     554
Name: high_bp, dtype: int64

In [5]:
myData['regular_smoker'].value_counts()

No     2798
Yes    2312
Name: regular_smoker, dtype: int64