In [1]:
import pandas as pd
import numpy as np

***
# 2017 NSCH Data

This data is the National Survey of Children's Health (NSCH) in 2017. This notebook will process and clean the data.

### Code book:
https://www.census.gov/data-tools/demo/nsch/#/?s_searchvalue=age&s_year=2017&selectedVar=A1_AGE

### Guidelines for Data Use
According to the 2017 NSCN's [Guidelines for Data Use](https://www.census.gov/content/dam/Census/programs-surveys/nsch/tech-documentation/methodology/2017-NSCH-FAQs.pdf), users can use the data for statistical reporting and analysis. Also, users must not use the data to identify any respondents, inadvertently or otherwise. Finally, prior to releasing any statistics to the public, the Census Bureau conducts reviews to ensure that no information or characteristics can identify any individual.

### Citation
The United States Census Bureau, Associate Director of Demographic Programs, National Survey of Children’s Health. 2017 National Survey of Children’s Health Frequently Asked Questions. September 2018. Available from: https://www.census.gov/content/dam/Census/programs-surveys/nsch/techdocumentation/methodology/NSCH%202016%20FAQs.pdf
***

In [4]:
df = pd.read_stata('nsch_2017_topical.dta')
df.head()

Unnamed: 0,fipsst,stratum,hhid,formtype,totkids_r,hhlanguage,sc_age_years,sc_sex,k2q35a_1_years,momage,...,a1_grade_if,bmiclass,hhcount_if,fpl_i1,fpl_i2,fpl_i3,fpl_i4,fpl_i5,fpl_i6,fwc
0,37,1,17000010,T1,3,3.0,0,2,,36.0,...,0,,0,50,50,50,50,50,50,16407.556854
1,2,2A,17000013,T3,1,1.0,13,2,,30.0,...,0,2.0,0,181,181,181,181,181,181,152.449899
2,40,1,17000025,T3,1,1.0,15,1,13.0,30.0,...,0,2.0,0,329,329,329,329,329,329,605.780083
3,13,1,17000031,T2,1,1.0,9,1,,27.0,...,0,,0,347,347,347,347,347,347,1793.169449
4,31,1,17000034,T2,2,1.0,8,2,,27.0,...,0,,0,400,400,400,400,400,400,688.982814


## Data Processing

First, I'll create an empty DataFrame that have the same size as the original DataFrame.

In [42]:
df_clean = pd.DataFrame(index=df.index)

The first variable I want to save is the age of the children (from 0 - 17 years old)

### Age

`sc_age_years`: Age of Selected Child - In Years (0 - 17)

In [43]:
df.sc_age_years.describe()

count    21599.000000
mean         9.428585
std          5.260100
min          0.000000
25%          5.000000
50%         10.000000
75%         14.000000
max         17.000000
Name: sc_age_years, dtype: float64

In [44]:
df_clean['sc_age_years'] = df.sc_age_years.copy()

***

### Screen time

These two variables are the primary ones we will work with. I will subtract the data by 1 so that the response code of 0 means "Spent zero hours" and rename the variables to `cmputr_time` and `tv_time`.

`k7q91_r`: How Much Time Spent with Computers (renamed to `cmputr_time`)

`k7q60_r`: How Much Time Spent Watching TV (renamed to `tv_time`)

**Response Codes:**
- 1 = None
- 2 = Less than 1 hour
- 3 = 1 hour
- 4 = 2 hours
- 5 = 3 hours
- 6 = 4 or more hours

In [45]:
df_clean['cmputr_time'] = df.k7q91_r.copy() - 1
df_clean.cmputr_time.describe()

count    21402.000000
mean         2.279507
std          1.555088
min          0.000000
25%          1.000000
50%          2.000000
75%          3.000000
max          5.000000
Name: cmputr_time, dtype: float64

In [46]:
df_clean['tv_time'] = df.k7q60_r.copy() - 1
df_clean.tv_time.describe()

count    21399.000000
mean         2.314547
std          1.321909
min          0.000000
25%          1.000000
50%          2.000000
75%          3.000000
max          5.000000
Name: tv_time, dtype: float64

***
### Anxiety level

These 3 variables are directly related to the **Anxiety Severity Level** of the children. I processed the data so that the final data will indicate the anxiety severity level from Mild (1) to Severe (3) or None (0) or Data not available (NaN)

`k2q33a`: Has a doctor or other health care provider EVER told you that (fill with SC_NAME) has...Anxiety Problems? **Yes=1**, **No=2** or **NaN**

`k2q33b`: If yes, does (fill with SC_NAME) CURRENTLY have the condition? **Yes=1**, **No=2** or **NaN**

`k2q33c`: Anxiety Severity Level.

**Response Code:**
- 1 = Mild
- 2 = Moderate
- 3 = Severe

**Create a column indicating Anxiety Severity Level:**

**Response Code:**
- NaN = Did not provide answers to any of these questions
- 0 = Did not have anxiety (`k2q33a` is **No** or `k2q33b` is **No**)
- 1 = Mild
- 2 = Moderate
- 3 = Severe

In [47]:
# Cleaning data to make sure NaN will not mess up with 0
anxiety_lvl = df.k2q33c.copy()
for i in range(len(df['k2q33c'])):
    if df['k2q33a'][i] == 2 or df['k2q33b'][i] == 2: # appropriately update NaN to be 0
        if np.isnan(anxiety_lvl[i]):
            anxiety_lvl[i] = 0
df_clean['anxiety_lvl'] = anxiety_lvl
df_clean['anxiety_lvl'].describe()

count    21479.000000
mean         0.132269
std          0.480712
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          3.000000
Name: anxiety_lvl, dtype: float64

***
### Depression level

These 3 variables are directly related to the **Depression Severity Level** of the children. I processed the data so that the final data will indicate the depression severity level from Mild (1) to Severe (3) or None (0) or Data not available (NaN)

`k2q32a`: Has a doctor or other health care provider EVER told you that (fill with SC_NAME) has...Depression Problems? **Yes=1**, **No=2** or **NaN**

`k2q32b`: If yes, does (fill with SC_NAME) CURRENTLY have the condition? **Yes=1**, **No=2** or **NaN**

`k2q32c`: Depression Severity Level.

**Response Code:**
- 1 = Mild
- 2 = Moderate
- 3 = Severe

**Create a column indicating Depression Severity Level:**

**Response Code:**
- NaN = Did not provide answers to any of these questions
- 0 = Did not have depression (`k2q32a` is **No** or `k2q32b` is **No**)
- 1 = Mild
- 2 = Moderate
- 3 = Severe

In [48]:
# Cleaning data to make sure NaN will not mess up with 0
depression_lvl = df.k2q32c.copy()
for i in range(len(df['k2q32c'])):
    if df['k2q32a'][i] == 2 or df['k2q32b'][i] == 2:
        if np.isnan(depression_lvl[i]):
            depression_lvl[i] = 0
df_clean['depression_lvl'] = depression_lvl
df_clean['depression_lvl'].describe()

count    21520.000000
mean         0.056691
std          0.317323
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          3.000000
Name: depression_lvl, dtype: float64

***
### Need treatment

This variable does not directly relate to anxiety and depression level. However, I'll save it just in case we need it.

`k4q22_r`: DURING THE PAST 12 MONTHS, has (fill with SC_NAME) received any treatment or counseling from a mental health professional...Mental health professionals include psychiatrists, psychologists, psychiatric nurses, and clinical social workers. The variable will be renamed to `need_treatment`.

**Response Code:**
- 1 = Yes
- 2 = No, but this child needed to see a mental health professional
- 3 = No, this child did not need to see a mental health professional

In [49]:
df_clean['need_treatment'] = df.k4q22_r.dropna().copy()
df_clean.need_treatment.describe()

count    21504.000000
mean         2.773484
std          0.619546
min          1.000000
25%          3.000000
50%          3.000000
75%          3.000000
max          3.000000
Name: need_treatment, dtype: float64

***
### Family meals

This is an interesting variable, which will be used in Project 1. Please note that the response code only indicates the level, not the exact number of meals.

`k8q11`: DURING THE PAST WEEK, on how many days did all the family members who live in the household eat a meal together? (renamed to `family_meals`)

**Response Code:**
- 1 = 0 days
- 2 = 1-3 days
- 3 = 4-6 days
- 4 = Every day

In [50]:
df_clean['family_meals'] = df.k8q11.dropna().copy() - 1
df_clean.family_meals.describe()

count    21278.000000
mean         2.065232
std          0.866397
min          0.000000
25%          1.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: family_meals, dtype: float64

***
### Homework unfinished

This variable is not used in my project. However, I'll save it just in case we need it.

`k7q83_r`: How well do each of the following phrases describe: Your child does all required homework. (renamed to `homework_unfinished`. Highest value/level indicates that the children do not finish all required homwork.)

**Response Code:**
- 1 = Definitely true
- 2 = Somewhat true
- 3 = Not true

In [51]:
df_clean['homework_unfinished'] = df.k7q83_r.dropna().copy()
df_clean.homework_unfinished.describe()

count    15238.000000
mean         1.305027
std          0.550181
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          3.000000
Name: homework_unfinished, dtype: float64

***
## Save the DataFrame

In [53]:
df_clean.head()

Unnamed: 0,sc_age_years,cmputr_time,tv_time,anxiety_lvl,depression_lvl,need_treatment,family_meals,homework_unfinished
0,0,0.0,0.0,0.0,0.0,3.0,1.0,
1,13,3.0,2.0,0.0,0.0,3.0,2.0,1.0
2,15,3.0,1.0,0.0,0.0,3.0,2.0,2.0
3,9,4.0,4.0,0.0,0.0,3.0,1.0,1.0
4,8,1.0,2.0,0.0,0.0,3.0,2.0,1.0


In [54]:
df_clean.to_pickle("./nsch_2017_topical_clean.pkl")