## The dataset and codebook can be found here: https://www.datafiles.samhsa.gov/study-dataset/national-survey-drug-use-and-health-2016-nsduh-2016-ds0001-nid17185
## In this notebook I'm going to find the index for each survey question I'm keeping in my dataframe, then pull all of those columns for every row (where each row is a participant). The study is too big to pull everything in first, that will crash the notebook

Start with necessary imports!

In [12]:
import pandas as pd
import numpy as np
import csv

I utilized the codebook to find a specific question I was interested in including, then I used this to find the correct index for that question in the data

In [14]:
def indexfinder(input):
    with open('NSDUH_2016_Tab.tsv') as tsvfile:
        reader = csv.reader(tsvfile, delimiter='\t')
        for row in reader:
            ind = row.index(input)
            break
        return ind

So for example, the question that had the code AMDELT was in the 2074th column, and I knew to keep that column

In [15]:
indexfinder("AMDELT")

2074

This downloads all rows (participants) for all of the columns/questions of interest in my model

In [16]:
ls = []
with open('NSDUH_2016_Tab.tsv') as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    for row in reader:
        ls.append([row[2074],row[2075],row[1693],row[1718],row[1824],row[1826],row[2503],row[2515],row[2517],row[2582],row[2603],row[1973],row[2654],row[2526],row[2514],row[2609],row[2525],row[2539]])

Light reformatting of the data as a dataframe

In [17]:
df = pd.DataFrame(np.array(ls).reshape(len(ls),18))
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))

And then we save the dataframe!

In [18]:
df.to_csv('MH_DATA_UNCHANGED.csv', index=False, header=True)