# Data Management

This content was created while following the [Statistics with Python Specialization](https://www.coursera.org/specializations/statistics-with-python) from Coursera.
In particular, this document is related to the course within the specialization [Understanding and Visualizing Data with Python](https://www.coursera.org/learn/understanding-visualization-data?specialization=statistics-with-python).

This notebook summarizes the content of the lab related to data management introduction, which focuses on the most basic `pandas` calls. Additionally, the two datasets if the course are introduced:
- Cartwheel
- [NHANES Dataset](https://wwwn.cdc.gov/nchs/nhanes/Default.aspx): National Health and Nutrition Examination Survey, from the CDC (USA)

All in all, very simple python/pandas calls are used, thus, few comments are added here.

Overview:
1. Pandas DataFrames: Intro with the Cartwheel Dataset
    - Selection
    - Group By
2. Pandas DataFrame - NHANES Dataset
3. Python guidelines
   - Cheatseets: NUmpy, Pandas, Scipy, Matplotlib
   - Style guidelines (based on Google)

In [25]:
import numpy as np
import pandas as pd

In [5]:
a = np.array([0,1,2,3,4,5,6,7,8,9,10]) 
np.mean(a)

5.0

## 1. Pandas DataFrames: Intro with the Cartwheel Dataset

In [3]:
# Store the url string that hosts our .csv file (note that this is a different url than in the video)
url = "Cartwheeldata.csv"

# Read the .csv file and store it as a pandas Data Frame
df = pd.read_csv(url)

# Output object type
type(df)

pandas.core.frame.DataFrame

In [4]:
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [6]:
df.columns

Index(['ID', 'Age', 'Gender', 'GenderGroup', 'Glasses', 'GlassesGroup',
       'Height', 'Wingspan', 'CWDistance', 'Complete', 'CompleteGroup',
       'Score'],
      dtype='object')

### Selection

Three main functions:
1. `df.loc[]`
2. `df.iloc[]`
3. `df.ix[]`

In [7]:
# Return all observations of CWDistance
df.loc[:,"CWDistance"]

0      79
1      70
2      85
3      87
4      72
5      81
6     107
7      98
8     106
9      65
10     96
11     79
12     92
13     66
14     72
15    115
16     90
17     74
18     64
19     85
20     66
21    101
22     82
23     63
24     67
Name: CWDistance, dtype: int64

In [8]:
# Select all rows for multiple columns, ["CWDistance", "Height", "Wingspan"]
df.loc[:,["CWDistance", "Height", "Wingspan"]]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


In [9]:
# Select few rows for multiple columns, ["CWDistance", "Height", "Wingspan"]
df.loc[:9, ["CWDistance", "Height", "Wingspan"]]

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


In [10]:
# Select range of rows for all columns
df.loc[10:15]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
10,11,30,M,2,Y,1,69.5,66.0,96,Y,1,6
11,12,28,F,1,Y,1,62.75,58.0,79,Y,1,10
12,13,25,F,1,Y,1,65.0,64.5,92,Y,1,6
13,14,23,F,1,N,0,61.5,57.5,66,Y,1,4
14,15,31,M,2,Y,1,73.0,74.0,72,Y,1,9
15,16,26,M,2,Y,1,71.0,72.0,115,Y,1,6


In [11]:
df.loc[:9, "CWDistance"]

0     79
1     70
2     85
3     87
4     72
5     81
6    107
7     98
8    106
9     65
Name: CWDistance, dtype: int64

In [13]:
# First four columns
# iloc is integer-based selection, only integers
# in contrast, loc is label-based for the columns
df.iloc[:4]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10


In [14]:
df.iloc[1:5, 2:4]

Unnamed: 0,Gender,GenderGroup
1,F,1
2,F,1
3,F,1
4,M,2


In [15]:
# Column types
df.dtypes

ID                 int64
Age                int64
Gender            object
GenderGroup        int64
Glasses           object
GlassesGroup       int64
Height           float64
Wingspan         float64
CWDistance         int64
Complete          object
CompleteGroup      int64
Score              int64
dtype: object

In [16]:
# List unique values in the df['Gender'] column
df.Gender.unique()

array(['F', 'M'], dtype=object)

In [17]:
# Lets explore df["GenderGroup] as well
df.GenderGroup.unique()

array([1, 2])

In [18]:
# Use .loc() to specify a list of mulitple column names
df.loc[:,["Gender", "GenderGroup"]]

Unnamed: 0,Gender,GenderGroup
0,F,1
1,F,1
2,F,1
3,F,1
4,M,2
5,M,2
6,M,2
7,F,1
8,M,2
9,F,1


### Group By

In [20]:
df.groupby(['Gender','GenderGroup']).size()

Gender  GenderGroup
F       1              12
M       2              13
dtype: int64

In [24]:
df.groupby(['Gender','GenderGroup']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,ID,Age,GlassesGroup,Height,Wingspan,CWDistance,CompleteGroup,Score
Gender,GenderGroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
F,1,129,358,8,765.75,738.5,963,10,90
M,2,196,348,6,925.5,918.0,1099,9,70


## 2. Pandas DataFrames: NHANES Dataset

[NHANES Dataset](https://wwwn.cdc.gov/nchs/nhanes/Default.aspx): National Health and Nutrition Examination Survey, from the CDC (USA).

In [26]:
import pandas as pd

In [27]:
url = "nhanes_2015_2016.csv"
da = pd.read_csv(url)

In [28]:
da.shape

(5735, 28)

In [29]:
da.columns

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
       'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',
       'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',
       'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',
       'BMXWAIST', 'HIQ210'],
      dtype='object')

The codebooks for the 2015-2016 wave of NHANES can be found here:

https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015

Direct links:

- [Demographics code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm)
- [Body measures code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm)
- [Blood pressure code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm)
- [Alcohol questionaire code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/ALQ_I.htm)
- [Smoking questionaire code book](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/SMQ_I.htm)

In [30]:
da.dtypes

SEQN          int64
ALQ101      float64
ALQ110      float64
ALQ130      float64
SMQ020        int64
RIAGENDR      int64
RIDAGEYR      int64
RIDRETH1      int64
DMDCITZN    float64
DMDEDUC2    float64
DMDMARTL    float64
DMDHHSIZ      int64
WTINT2YR    float64
SDMVPSU       int64
SDMVSTRA      int64
INDFMPIR    float64
BPXSY1      float64
BPXDI1      float64
BPXSY2      float64
BPXDI2      float64
BMXWT       float64
BMXHT       float64
BMXBMI      float64
BMXLEG      float64
BMXARML     float64
BMXARMC     float64
BMXWAIST    float64
HIQ210      float64
dtype: object

In [33]:
# Different ways of accessing a column
w = da["DMDEDUC2"]
x = da.loc[:, "DMDEDUC2"]
y = da.DMDEDUC2
z = da.iloc[:, 9]  # DMDEDUC2 is in column 9

In [34]:
print(da["DMDEDUC2"].max())
print(da.loc[:, "DMDEDUC2"].max())
print(da.DMDEDUC2.max())
print(da.iloc[:, 9].max())

9.0
9.0
9.0
9.0


In [35]:
print(type(da)) # The type of the variable
print(type(da.DMDEDUC2)) # The type of one column of the data frame
print(type(da.iloc[2,:])) # The type of one row of the data frame

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [36]:
# Extract row 3
x = da.iloc[3, :]

In [37]:
# Extract row-column ranges
x = da.iloc[3:5, :]
y = da.iloc[:, 2:5]

### Missing Values

In [38]:
print(pd.isnull(da.DMDEDUC2).sum())
print(pd.notnull(da.DMDEDUC2).sum())

261
5474


## 3. Python Guidelines

Cheat sheets:

- [Cheatsheet for Numpy](https://www.datacamp.com/community/blog/python-numpy-cheat-sheet#gs.AK5ZBgE)
- [Cheatsheet for Datawrangling](https://www.datacamp.com/community/blog/pandas-cheat-sheet-python#gs.HPFoRIc)
- [Cheatsheet for Pandas](https://www.datacamp.com/community/blog/python-pandas-cheat-sheet#gs.oundfxM)
- [Cheatsheet for SciPy](https://www.datacamp.com/community/blog/python-scipy-cheat-sheet#gs.JDSg3OI)
- [Cheatsheet for Matplotlib](https://www.datacamp.com/community/blog/python-matplotlib-cheat-sheet#gs.uEKySpY)

Selected guidelines from the [Google Python Style Guide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md):

- Consistent indenting
- Always comment, and comment well: concise and useful info, do not mix styles
- Avoid excessive long lines (try to be < 80 characters/line)
- Use white spacs to improve code readability