In [3]:
import csv
import pandas as pd
import numpy as np
import scipy as sp
from itables import init_notebook_mode

In [4]:
%cd "C:\\Users\\Elena.Mariani\\Documents\\Data\\National Diet Nutrition Survey\\UKDA-6533-tab\\tab"

C:\Users\Elena.Mariani\Documents\Data\National Diet Nutrition Survey\UKDA-6533-tab\tab


In [5]:
init_notebook_mode(all_interactive=True)

<IPython.core.display.Javascript object>

# National Diet and Nutrition Survey 

The National Diet and Nutrition Survey Rolling Programme (NDNS RP) is a cross-sectional survey with a continuous programme of fieldwork, designed to assess the diet, nutrient intake and nutritional status of the general population aged 1.5 years and over living in private households in the UK.

The NDNS provides the only source of nationally representative UK data on the types and quantities of foods consumed by individuals, from which estimates of nutrient intake for the population are derived and on their nutritional status from 
analysis of blood and urinary biomarkers. Results are used by government to monitor progress toward diet and nutrition objectives of UK Health Departments and develop policy interventions.



## Sample Design

The survey aimed to collect data from a UK representative sample of 1000 people per fieldwork year, 500 adults (aged 19+ years) and 500 children (aged 1.5-18 years). The sample was drawn from the ‘small users’ sub-file of the Postcode Address File (PAF), a computer list, prepared by the Post Office, of all the addresses (delivery points) which receive fewer than 25 articles of mail a day. In order to improve cost effectiveness, the addresses were clustered into Primary Sampling Units (PSUs), small geographical areas, based on postcode sectors, randomly selected from across the UK. A list of addresses was randomly selected from each PSU. In each PSU, 28 addresses were randomly selected. At each address, the interviewer established the number of households and, in cases where there were two or more, selected one household at random.

## Survey Structure

**Stage 1: Interviewer Visit**
- Four-day food diary
- Face-to-face Computer Assisted Personal interview (CAPI)
- Height and weight measurements
- Smoking and drinking self-completion questionnaires (aged 8-17 years)
- Physical activity self-completion questionnaire (aged 16+ years)
- Collection of spot urine sample (aged 4+ years

**Stage 2: Nurse Visit**
- Fasting blood sample (aged 4+ years)
- Non-fasting blood sample (aged 1.5-3 years) 
- Physical measurements: waist and hip (aged 11+ years), demispan (all aged 65+ years and those aged 16-64 years from whom a reliable height measurement could not be taken) and infant length (under 2 years)
- Collection of information about prescribed medicines

## Variables in the Datasets

The individual and household datasets contain questionnaire variables (excluding variables used for administrative purposes), demographic information including household composition, laboratory results and derived variables. 

The dietary datasets contain variables coded from the diaries at food, day and person levels, plus dietary reference values and derived variables. 

## Merging Datasets

As various data are contained in different datasets, users may need to merge several datasets together for the purposes of their analysis. Individual serial number, survey year, age, sex and country variables are included in all the datasets for consistency.

Serial numbers composition:
- SERIALH: household serial number - The same number is allocated to each member of the same household. The first number corresponds to the survey year. It is included in the household and individual files.
- SERIALP: Individual identifier for each household member in a productive household. It is included in the household file only. (SERIALH + PGRID)
- SERIALI:  Individual serial number for each productive individual. It is included in the household and individual files and in all dietary files. (SERIALH + ADCHILD)

The individual file also contains the person number of the Household Reference Person (HRP) and the Main Food Provider (MFP) (variables HRPNO and MFPNUM respectively). To create individual serial numbers for either the HRP or MFP, add HRPNO or MFPNUM to SERIALH. Note that the HRP or MFP numbers correspond to the person number within each household. Therefore, due to the recoding of each productive individual to a 1 for the adult and 2 for the child, the HRP or MFP may not be the same individual although they may have the same serial number and vice versa.

## Using the Dietary Data

### Dietary Coding

- All individual ingredients of a homemade recipe as reported in the food diary, or components of the purchased product as described on the food packaging, have been coded as their separate food codes and linked together under the appropriate Recipe Food Group, which highlights that those food codes were consumed together in one composite dish. The following variables should be used when calculating food consumption data:
  - RECIPEMAINFOODGROUPCODE
  - RECIPEMAINFOODGROUPDESC
  - RECIPESUBFOODGROUPCODE
  - RECIPESUBFOODGROUPDESC
- An example is provided here: A homemade dish of Thai chicken curry containing chicken, Thai curry sauce, and onion would appear in the Food Level dietary dataset as three entries with the food names; CHICKEN BOILED LIGHT MEAT ONLY, THAI CURRY SAUCE PURCHASED, and ONIONS BOILED, linked to the MAINFOODGROUPDESC of “chicken and turkey dishes”, “miscellaneous” and“vegetables not raw”, respectively. As these three foods were consumed together in one composite dish they are assigned to the RECIPESUBFOODGROUPDESC of Other chicken/turkey including homemade recipe dishes. 
- Recipes names are included in the Years 9-11 dataset.
- To estimate absolute food consumption of one specific food type examine the FOODNAME and MAINFOODGROUPDESC variables, whilst examining disaggregation variables of any foods that are composites (NB disaggregation data is only provided for certain categories of meat, fish, fruit and vegetables). For example, to estimate absolute intakes of sausages from all sources you would need to include all the specific discrete portions of sausages, as well as calculate the percentage of sausages within all composite foods such as meat pies.
- All foods consumed have a base unit of grams that is, the amount consumed is described in grams. The exceptions are dietary supplements and artificial sweeteners. These have a base unit based on their form i.e. tablet, teaspoon. To avoid errors when calculating consumption, these have only been included in the food level dietary data file. When using this file, it should be noted that, for dietary supplements and artificial sweeteners, the value in the Total_Grams column is not a value in grams but a value in terms of the base unit, i.e. 0.5 for a granulated artificial sweetener would refer to 0.5 of a teaspoon not 0.5 grams.

## Weighting Variables

### Description of Weights

The NDNS requires weights to correct for differences in sample selection and response. The weights adjust for differential selection probabilities of households and individuals, non-response to the individual and RPAQ questionnaires, non-response to the nurse visit and non-response to the blood sample. Non-response weights were generated using a mixture of non-response modelling and calibration weighting methods.

| Weight  | Description | Use for |
| :----------- | :----------- | :----------- |
| wti_Y911 | Weight for non-response by individuals to the individual questionnaire and diary | Any analysis of individuals using data from the individual questionnaire or diary |
| wtn_Y911    | Weight for non-response by individuals to the nurse visit  | Any analysis of individuals using data collected at the nurse visit |
|wtb_Y911 | Weight for non-response by individuals to the blood sample | Any analysis of individuals using the blood sample data |
| wtr_Y911 | Weight for analysis of RPAQ  |Any analysis of RPAQ info for individuals aged 16+ |
| wtsu_Y911 | Weight for analysis of spot urinary iodine data | Any analysis of individuals with spot urinary iodine data |

There is a single weight for all individuals, rather than separate weights for adults and children. This means the sample needs to be filtered by age to ensure the correct ages are included. However, this means different age breaks to those presented in the published tables can be used, i.e. those aged 16 to 18 years can be combined with adults (19 years and over), which allows more flexibility in reporting. 

**wti_Y911**: they adjust for dwelling unit, catering unit and individual selection, and for the age/sex and regional profiles of participating individuals. This weight should be used for any analyses of interview and food data in the data.


### Strata

There are 5 strata altogether and the idea is to use the one that’s most appropriate for the analysis that you’re doing. The way to find this out is to try them in sequence (starting from astrata1) i.e. if astrata1 is "not working" (see below for explanation of what this means), then you can try using astrata2, then astrata3 etc. (up to astrata5).

“not working” means that there are more than a handful of single PSU strata to deal with. For example, if you are analysing the whole sample then astrata1 will be fine. But let’s say you’re analysing a subgroup e.g. aged 65+ then using astrata 1 will probably result in a number of single PSU strata. If there are a handful you can recode them appropriately (group them with adjacent strata within region) but if there are more than a few then you can try using astrata 2 instead. If there are still too many single PSU strata to recode then you would try astrata 3 and so on. 

### Combining data from previous NDNS years (Years 1-8)

The NDNS datasets for Years 1-4, Years 5&6, Years 7&8 and Years 9-11 can be combined for analysis of Years 1-11 but, to produce valid results, the four sets of weights should be re-scaled. This will ensure that the four sets of data are in their correct 
proportions). A different calculation is required for each weight (individual, nurse, blood etc.).

To re-scale the weights correctly, it is necessary to perform the following calculations:
1. Divide each weight variable by its sum (i.e. the sum of the weights);
2. Multiply each by the combined sum of the four weights (15,655);
3. Multiply the Years 1-4 weight by 4/11, Years 5&6 weight by 2/11, Years 7&8 weight by 2/11 and Years 9-11 weight by 3/11.
4. The resulting weights can then be combined into one variable.

The above explanation assumes that analysis will be performed for all cases i.e. all adults and children. If analysis of subgroups is required, analogous calculations should be performed on the combined dataset filtered to include only the subgroup of interest. This will produce bespoke weights for analysis of that particular subgroup (adults only for example).

One additional step is required but otherwise the procedure is the same:
1. Divide each weight variable by its sum (i.e. the sum of the weights);
2. Multiply each by the total (combined) sum of the four weights;
3. Multiply the Years 1-4 weight by 4/11, the Years 5&6 weight by 2/11, the Year 7&8 weight by 2/11 and the Years 9-11 weight by 3/11;
4. Combine the resulting weights into one variable;
5. Re-scale this weight to have a mean of 1. This additional step ensures that the resulting weights have a mean of 1.


## Individual Data

Contains data for all fully productive individuals i.e. completed three/four food diary days. It contains information from the household questionnaire, main individual schedule, self-completions, physical measurements and nurse visit (where one occurred). It also includes blood sample results, and spot iodine data.

In [8]:
ind = pd.read_csv("ndns_rp_yr9-11a_indiv_20211020.tab", sep = "\t", low_memory=False)
ind.head()



seriali,serialh,Area,region,GOR,Addnum,surveyyr,AdChild,Quarter,month,wti_Y911,wtb_Y911,wtn_Y911,wtr_Y911,wtsu_Y911,wti_Y9,wtb_Y9,wtn_Y9,wtr_Y9,wtsu_Y9
Loading... (need help?),,,,,,,,,,,,,,,,,,,


## Household Data

Contains data on household composition, sex, age and marital status for all individuals in co-operating households.

In [9]:
hh = pd.read_csv("ndns_rp_yr9-11a_hhold_20210831.tab", sep = "\t")
hh.head()



seriali,serialh,surveyyr,Outcome,IOut,ScrType,DMHSize7,Sex,AgeR,MarSt2_r,ethgrp2,RelHRPr,Rel01r,Rel02r,Rel03r,Rel04r,Rel05r,Rel06r,Rel07r,Country
Loading... (need help?),,,,,,,,,,,,,,,,,,,


## Food Level Diary Data

Includes nutrient data and  disaggregation at food level. Also, shows who else was present at the eating occasion, where the participant was located, whether the television was on and whether or not the participant was sitting at a table. Each row is an eating occasion for a study participant.

In [6]:
fdd = pd.read_csv("ndns_rp_yr11a_foodleveldietarydata_uk_20210831.tab", sep = "\t")
fdd.head()



seriali,SurveyYear,AgeR,Sex,CoreBoost,Country,DiaryDate,DayofWeek,DayNo,DiaryDaysCompleted,CannedTunag,Shellfishg,CottageCheeseg,CheddarCheeseg,OtherCheeseg,TotalGrams,WhoWith,Where,WatchingTV,Table
Loading... (need help?),,,,,,,,,,,,,,,,,,,


## Day Level Diary Data

Daily food consumption data calculated using recipe main food groups and recipe sub food groups data.

In [7]:
ddd = pd.read_csv("ndns_rp_yr9-11a_dayleveldietarydata_foods_uk_20210831.tab", sep = "\t")
ddd.head()



seriali,SurveyYear,DiaryDate,DayofWeek,AgeR,Sex,Country,BACONANDHAM,BEEFVEALANDDISHES,BEERLAGERCIDERPERRY,SAVOURYSAUCESPICKLESGRAVIESCONDIMENTS,SOUPHOMEMADEANDRETAIL,SpecialDiet,Supps,UsualFoodQuantity,LessFoodReason,MoreFoodReason,UsualDrinkQuantity,LessDrinkReason,MoreDrinkReason
Loading... (need help?),,,,,,,,,,,,,,,,,,,


## Person Level Diary Data

Mean intakes of nutrients, food consumption data calculated using recipe main food groups and recipe sub food groups data plus disaggregated food at the participant level. Also includes derived variables such as LRNI and RNI indicators and percentages.

In [11]:
pdd = pd.read_csv("ndns_rp_yr9-11a_personleveldietarydata_uk_20210831.tab", sep = "\t")
pdd.head()



seriali,SurveyYear,NDays,AgeR,Sex,Country,TotalEMJ,FoodEMJ,EnergykJ,FoodEkJ,FOLICACID_CAPI,IRONONLYORWITHVITAMINC_CAPI,VITC_CAPI,OTHERNUTRIENTSUPPLEMENTS_CAPI,VITAMINSTWOORMOREINCLMULTIVITSNOMINERALS_CAPI,VITAMINSANDMINERALSINCLMULTIVITSMINERALS_CAPI,NONNUTRIENTSUPPLEMENTSINCLHERBAL_CAPI,SINGLEVITAMINSMINERALS_CAPI,MULTIVITAMINANDORMINERALSWITHOMEGA3_CAPI,SuppTaker_CAPI
Loading... (need help?),,,,,,,,,,,,,,,,,,,


In [12]:
list(pdd)

['seriali',
 'SurveyYear',
 'NDays',
 'AgeR',
 'Sex',
 'Country',
 'TotalEMJ',
 'FoodEMJ',
 'EnergykJ',
 'FoodEkJ',
 'Energykcal',
 'FoodEkcal',
 'Proteing',
 'Fatg',
 'Saturatedfattyacidsg',
 'CisMonounsaturatedfattyacidsg',
 'Cisn6fattyacidsg',
 'Cisn3fattyacidsg',
 'Transfattyacidsg',
 'Carbohydrateg',
 'Totalsugarsg',
 'Othersugarsg',
 'Starchg',
 'Glucoseg',
 'Fructoseg',
 'Sucroseg',
 'Maltoseg',
 'Lactoseg',
 'Nonmilkextrinsicsugarsg',
 'Intrinsicandmilksugarsg',
 'Intrinsicandmilksugarsandstarchg',
 'Englystfibreg',
 'FreeSugarsg',
 'AOACFibreg',
 'Retinolµg',
 'Totalcaroteneµg',
 'Alphacaroteneµg',
 'Betacaroteneµg',
 'Betacryptoxanthinµg',
 'VitaminAretinolequivalentsµg',
 'VitaminDµg',
 'VitaminEmg',
 'Thiaminmg',
 'Riboflavinmg',
 'Niacinequivalentmg',
 'VitaminB6mg',
 'VitaminB12µg',
 'Folateµg',
 'Pantothenicacidmg',
 'Biotinµg',
 'VitaminCmg',
 'Sodiummg',
 'Potassiummg',
 'Calciummg',
 'Magnesiummg',
 'Phosphorusmg',
 'Ironmg',
 'Haemironmg',
 'Nonhaemironmg',
 'Copperm

## Nutrient Data Bank

In [86]:
ndb = pd.read_csv("ndns_yr11_nutrientdatabank_2021-03-19.tab", sep = "\t")
ndb.head()

Unnamed: 0,FoodNumber,FoodName,FoodCategory,Base,FoodGroupCode,Dilution,Units,Description,Comment,EdiblePortion,...,Poultry,ProcessedPoultry,GameBirds,WhiteFish,OilyFish,CannedTuna,Shellfish,CottageCheese,CheddarCheese,OtherCheese
0,1,ARROWROOT POWDER,F,100,1R,1,Grams,,,1,...,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,PEARL BARLEY WHITE DRIED,F,100,1R,1,Grams,,,1,...,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,BARLEY PEARL WHITE BOILED IN WATER,F,100,1R,1,Grams,,,1,...,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,BARLEY WHOLE GRAIN DRIED,F,100,1R,1,Grams,,,1,...,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,BARLEY WHOLEGRAIN BOILED IN WATER,F,100,1R,1,Grams,,,1,...,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
