In [1]:
import numpy as np
import pandas as pd

Download the topical data file, in SAS format, from:

https://www2.census.gov/programs-surveys/nsch/datasets/2021/nsch_2021_topical_SAS.zip

Turn it into a data frame. We're only interested in the following columns:

FIPSST

VEGETABLE

FRUIT

SUGARDRINK

Turn the FIPSST column into an integer, and make it the index.

In [2]:
filename = "C:/Users/Pawan Kataria/Downloads/nsch_2021_topical_SAS/nsch_2021e_topical.sas7bdat"

df = pd.read_sas(filename)
df.head()

Unnamed: 0,FIPSST,STRATUM,HHID,FORMTYPE,TOTKIDS_R,TENURE,HHLANGUAGE,SC_AGE_YEARS,SC_SEX,K2Q35A_1_YEARS,...,HIGRADE_TVIS,FPL_I1,FPL_I2,FPL_I3,FPL_I4,FPL_I5,FPL_I6,FWC,HEIGHT,WEIGHT
0,b'48',b'1',b'21000002',b'T1',2.0,1.0,3.0,2.0,1.0,,...,4.0,400.0,400.0,400.0,400.0,400.0,400.0,2311.502888,,
1,b'02',b'1',b'21000009',b'T3',1.0,2.0,1.0,12.0,1.0,,...,2.0,139.0,139.0,139.0,139.0,139.0,139.0,72.154871,147.32,40.82
2,b'40',b'1',b'21000017',b'T2',2.0,1.0,1.0,8.0,1.0,,...,4.0,395.0,395.0,395.0,395.0,395.0,395.0,2286.078547,132.08,27.21
3,b'26',b'1',b'21000030',b'T1',1.0,1.0,1.0,2.0,2.0,,...,2.0,336.0,336.0,336.0,336.0,336.0,336.0,834.600631,,
4,b'22',b'1',b'21000031',b'T1',3.0,3.0,1.0,5.0,2.0,,...,2.0,96.0,96.0,96.0,96.0,96.0,96.0,768.121102,,


In [3]:
df = df[['FIPSST', 'VEGETABLES', 'FRUIT', 'SUGARDRINK']]
df.head()

Unnamed: 0,FIPSST,VEGETABLES,FRUIT,SUGARDRINK
0,b'48',3.0,2.0,1.0
1,b'02',,,
2,b'40',,,
3,b'26',2.0,2.0,2.0
4,b'22',2.0,3.0,2.0


In [4]:
df['FIPSST'] = df['FIPSST'].astype(np.int8)
df = df.set_index('FIPSST')

In [5]:
df.head()

Unnamed: 0_level_0,VEGETABLES,FRUIT,SUGARDRINK
FIPSST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
48,3.0,2.0,1.0
2,,,
40,,,
26,2.0,2.0,2.0
22,2.0,3.0,2.0


What percentage of children had, on average, less than one vegetable per day during the week preceding the study?

To understand the data set we're examining, we need to delve into its details. This can be accomplished by consulting the data dictionary, or the “codebook.” This helpful web application provides explanations for any columns in the study. You can access the codebook here: https://www.census.gov/data-tools/demo/uccb/nschdict.

In addition to FIPSST, which serves as a numeric identifier for a state or territory, the data set includes three columns of interest:

- **VEGETABLES:** A categorical column indicating the frequency of vegetable consumption by the child in the past week:
  1. This child did not eat vegetables
  2. 1-3 times during the past week
  3. 4-6 times during the past week
  4. 1 time per day
  5. 2 times per day
  6. 3 or more times per day

- **FRUIT:** Similar to VEGETABLES, this categorical column describes how many pieces of fruit the child ate in the past week:
  1. This child did not eat fruit
  2. 1-3 times during the past week
  3. 4-6 times during the past week
  4. 1 time per day
  5. 2 times per day
  6. 3 or more times per day

- **SUGARDRINK:** This column details the frequency of sugary drink consumption by the child in the past week:
  1. This child did not drink sugary drinks
  2. 1-3 times during the past week
  3. 4-6 times during the past week
  4. 1 time per day
  5. 2 times per day
  6. 3 or more times per day

In [11]:
df['VEGETABLES'] < 4

FIPSST
48     True
2     False
40    False
26     True
22     True
      ...  
55    False
36    False
30    False
55    False
28     True
Name: VEGETABLES, Length: 50892, dtype: bool

In [10]:
df.loc[df['VEGETABLES'] < 4 , 'VEGETABLES']

FIPSST
48    3.0
26    2.0
22    2.0
2     2.0
16    3.0
     ... 
35    3.0
16    3.0
38    2.0
44    3.0
28    2.0
Name: VEGETABLES, Length: 8669, dtype: float64

As always when using .loc, the first argument specifies the rows, and the second argument specifies the columns. In this case, we're selecting the VEGETABLES column in df, but only for rows where the value of VEGETABLES is less than 4.

In [12]:
df.loc[df['VEGETABLES'] < 4, 'VEGETABLES'].count()

8669

In [13]:
df.loc[df['VEGETABLES'] < 4 , 'VEGETABLES'].count() / df['VEGETABLES'].count()

0.47014480177883833

I obtained a result of 0.47, indicating that 47 percent—nearly half—of children aged 5 and under in the United States consume less than one vegetable per day.

What percentage of children had, on average, less than one vegetable per day and less than one fruit per day during the week preceding the study?

In [16]:
(df['VEGETABLES'] < 4) & (df['FRUIT'] < 4)

FIPSST
48     True
2     False
40    False
26     True
22     True
      ...  
55    False
36    False
30    False
55    False
28    False
Length: 50892, dtype: bool

In [20]:
df.loc[(df['VEGETABLES'] < 4) & (df['FRUIT'] < 4)]

Unnamed: 0_level_0,VEGETABLES,FRUIT,SUGARDRINK
FIPSST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
48,3.0,2.0,1.0
26,2.0,2.0,2.0
22,2.0,3.0,2.0
2,2.0,3.0,2.0
16,3.0,2.0,2.0
...,...,...,...
9,2.0,2.0,1.0
9,3.0,2.0,2.0
34,1.0,2.0,2.0
17,2.0,2.0,2.0


In [25]:
df.loc[(df['VEGETABLES'] < 4) & (df['FRUIT'] < 4), 
       ['VEGETABLES', 'FRUIT']]

Unnamed: 0_level_0,VEGETABLES,FRUIT
FIPSST,Unnamed: 1_level_1,Unnamed: 2_level_1
48,3.0,2.0
26,2.0,2.0
22,2.0,3.0
2,2.0,3.0
16,3.0,2.0
...,...,...
9,2.0,2.0
9,3.0,2.0
34,1.0,2.0
17,2.0,2.0


In [26]:
df.loc[(df['VEGETABLES'] < 4) & (df['FRUIT'] < 4), ['VEGETABLES', 'FRUIT']].count() / df['VEGETABLES'].count()

VEGETABLES    0.26265
FRUIT         0.26265
dtype: float64

I got approximately 0.26, indicating that about one quarter of young children in the US consume less than one fruit or vegetable each day.

What percentage of children had, on average, less than one vegetable per day and less than one fruit per day and did have a sugary drink during the week preceding the study?

In [27]:
df.loc[(df['VEGETABLES'] < 4) & 
       (df['FRUIT'] < 4) &
       (df['SUGARDRINK'] > 1), 
       'VEGETABLES'].count() / df['VEGETABLES'].count()

0.1590107923423179

I got a result of 0.159, which means that about 16 percent of young children in the US aren't eating fruits or vegetables but are drinking sugary drinks.

Download the FIPS state reference info, in CSV format, from https://www2.census.gov/geo/docs/reference/state.txt. Turn this into a data frame, with the STATE column as the index. The only other column we care about is STATE_NAME.

In [28]:
fips_url = 'https://www2.census.gov/geo/docs/reference/state.txt'

fips_df = pd.read_csv(fips_url, 
                      sep='|',
                      usecols=['STATE', 'STATE_NAME'],
                     index_col='STATE')

In [30]:
new_df = df.join(fips_df)
new_df

Unnamed: 0,VEGETABLES,FRUIT,SUGARDRINK,STATE_NAME
1,5.0,5.0,1.0,Alabama
1,,,,Alabama
1,,,,Alabama
1,3.0,3.0,1.0,Alabama
1,,,,Alabama
...,...,...,...,...
56,,,,Wyoming
56,5.0,6.0,2.0,Wyoming
56,,,,Wyoming
56,,,,Wyoming


The resulting row will have four columns: VEGETABLES, FRUIT, and SUGARDRINK from df, and STATE_NAME from fips_df.

What percentage of children, per state, had, on average, less than one vegetable per day during the week preceding the study?

In [31]:
new_df.loc[new_df['VEGETABLES']<4, 'VEGETABLES']

1     3.0
1     1.0
1     3.0
1     2.0
1     3.0
     ... 
56    2.0
56    3.0
56    2.0
56    2.0
56    2.0
Name: VEGETABLES, Length: 8669, dtype: float64

In [32]:
new_df.loc[new_df['VEGETABLES'] < 4,          
    ['STATE_NAME','VEGETABLES']]


Unnamed: 0,STATE_NAME,VEGETABLES
1,Alabama,3.0
1,Alabama,1.0
1,Alabama,3.0
1,Alabama,2.0
1,Alabama,3.0
...,...,...
56,Wyoming,2.0
56,Wyoming,3.0
56,Wyoming,2.0
56,Wyoming,2.0


In [34]:
new_df.loc[ new_df['VEGETABLES'] < 4,          
    ['STATE_NAME','VEGETABLES']].groupby('STATE_NAME').count()

Unnamed: 0_level_0,VEGETABLES
STATE_NAME,Unnamed: 1_level_1
Alabama,196
Alaska,148
Arizona,159
Arkansas,189
California,145
Colorado,215
Connecticut,168
Delaware,184
District of Columbia,137
Florida,178


In [37]:
new_df.loc[ new_df['VEGETABLES'] < 4,          
    ['STATE_NAME','VEGETABLES']].groupby('STATE_NAME').count()/ new_df.groupby('STATE_NAME')['VEGETABLES'].count()

Unnamed: 0_level_0,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,District of Columbia,Florida,...,Tennessee,Texas,Utah,VEGETABLES,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming
STATE_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama,,,,,,,,,,,...,,,,,,,,,,
Alaska,,,,,,,,,,,...,,,,,,,,,,
Arizona,,,,,,,,,,,...,,,,,,,,,,
Arkansas,,,,,,,,,,,...,,,,,,,,,,
California,,,,,,,,,,,...,,,,,,,,,,
Colorado,,,,,,,,,,,...,,,,,,,,,,
Connecticut,,,,,,,,,,,...,,,,,,,,,,
Delaware,,,,,,,,,,,...,,,,,,,,,,
District of Columbia,,,,,,,,,,,...,,,,,,,,,,
Florida,,,,,,,,,,,...,,,,,,,,,,


In [38]:
(new_df.loc[new_df['VEGETABLES'] < 4,
              ['STATE_NAME','VEGETABLES']].groupby('STATE_NAME').count()['VEGETABLES'] / new_df.groupby('STATE_NAME')['VEGETABLES'].count() * 100).sort_values() 

STATE_NAME
Vermont                 30.000000
Maine                   33.501259
District of Columbia    35.218509
Minnesota               38.235294
New Hampshire           38.977636
Montana                 39.265537
Kansas                  40.860215
Oregon                  41.118421
Tennessee               41.964286
Alaska                  42.045455
California              42.151163
Colorado                42.574257
Ohio                    42.660550
Washington              42.896936
Massachusetts           42.948718
Wisconsin               43.283582
Maryland                43.971631
North Dakota            44.342508
Wyoming                 44.943820
Connecticut             45.405405
Iowa                    46.089385
North Carolina          46.710526
New Mexico              46.905537
South Carolina          47.040498
Michigan                47.222222
Missouri                47.321429
Pennsylvania            47.826087
Virginia                47.854785
West Virginia           48.245614
Geo

The children who eat the most vegetables are in Vermont, Maine, and District of Columbia. Those who eat the least are in Alabama, Mississippi, and Louisiana.