This notebook contains the following sections:
1. GSCOST in 2017 NHTS
2. Read in 1995 veh file, show full ids, and quick preselection
3. Examples for how to use weights (using 2017 files)
4. 1990 veh file - VEHYEAR

#### 1. GSCOST in 2017 NHTS

We have confirmed with our colleague that the label of GSCOST is incorrect and should be "Annualized fuel cost in US **dollars** per equivalent gallon" in 2017 NHTS.

#### 2. Read in 1995 veh file, show full ids, and quick preselection

In [31]:
import pandas as pd
import numpy as np
# this is how to "turn off" the scientific notation
# basically it pre-specifies the *display* format for all cols containing long float numbers
# source: https://stackoverflow.com/questions/21137150/format-suppress-scientific-notation-from-python-pandas-aggregation-results
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [12]:
# import 1995 veh file in .xpt format
# enter the path to the .xpt file for vehicle file of 1995 NPTS
path_to_veh1995 = r'E:\Demo\SAS_transport1995\Xpt\VEHICL95.EXP'
veh1995 = pd.read_sas(path_to_veh1995, format='xport', index=None, encoding="utf-8", chunksize=None, iterator=False)
# unselect the last three rows 
veh1995_1 = veh1995[:-3].copy()
# check the row count between the two datasets
print(len(veh1995_1), len(veh1995))

75217 75220


#### 3. Examples for how to use weights (using 2017 trip and veh files)

Please read section *7.11 Weighting the Data* in [2017 Users' Guide](https://nhts.ornl.gov/assets/NHTS2017_UsersGuide_04232019_1.pdf) first!

**Note:** The process of applying the weights is specific to Python. Check the SPSS and/or Stata documentation for instructions on how to weight and then summarize weighted data.

In [57]:
# import 2017 trip and veh files
path_to_veh2017 = r'E:\Demo\NHTS2017_csv\vehpub.csv'
veh17 = pd.read_csv(path_to_veh2017)
path_to_trip2017 = r'E:\Demo\NHTS2017_csv\trippub.csv'
trip17 = pd.read_csv(path_to_trip2017)

##### 3.1. How many vehicles are under vehicle age 5 (VEHAGE< 5)

In [82]:
# first remove -7 and -8 from the dataset
veh17_1 = veh17[veh17['VEHAGE'] >= 0].copy()
# select records with VEHAGE < 5:
veh17_2 = veh17_1[veh17_1['VEHAGE'] < 5]

# Count the records to get the *sample size*:
count = len(veh17_2)
print('Sample size - unweighted: {0:,d}'.format(count))

# Sum the weight to get the estimated national total
wt_sum = veh17_2['WTHHFIN'].sum()
print('Weighted sum: {0:,.2f}'.format(wt_sum))

print(' \nNote: Sample size is only used to check whether there are any cases with small sample size (<30)')

Sample size - unweighted: 67,882
Weighted sum: 57,159,981.65
 
Note: Sample size is only used to check whether there are any cases with small sample size (<30)


##### 3.2. # of annualized VMT by trip mode by trip purpose summary (WHYTRP1S)

For filter condition for VT and VMT, refer to sections *7.5 Vehicle Trips* and *7.6 Vehicle Miles of Travel (VMT)* 
in the 2017 Users' Guide.

In [63]:
# Select VT trips for VMT calculation
pov17 = [3, 4, 5, 6, 8, 9, 18]
trip17_vt = trip17[(trip17['TRPTRANS'].isin(pov17)) & (trip17['DRVR_FLG'] == 1) & (trip17['TRPMILAD'] > -1)].copy()
# ****Apply Weight for unweighted VT miles (WTTRDFIN) - TRPMILAD_WT****
trip17_vt['TRPMILAD_WT'] = trip17_vt.apply(lambda x: x['TRPMILAD'] * x['WTTRDFIN'], axis=1)

In [65]:
# quick preview for selected columns
trip17_vt[['HOUSEID', 'PERSONID', 'TDTRPNUM', 'TRPTRANS', 'TRPMILAD', 'WHYTRP1S', 'WTTRDFIN', 'TRPMILAD_WT']].head()

Unnamed: 0,HOUSEID,PERSONID,TDTRPNUM,TRPTRANS,TRPMILAD,WHYTRP1S,WTTRDFIN,TRPMILAD_WT
0,30000007,1,1,3,5.848,20,75441.906,441152.911
1,30000007,1,2,3,5.742,1,75441.906,433161.011
2,30000007,2,1,6,90.178,1,71932.646,6486763.282
3,30000007,2,2,6,87.628,10,71932.646,6303289.286
4,30000007,3,1,3,2.509,20,80122.687,201025.818


In [66]:
# create cross-tabulation for weighted annual sum:
pd.crosstab(trip17_vt['TRPTRANS'], trip17_vt['WHYTRP1S'],
            trip17_vt['TRPMILAD_WT'], aggfunc=sum,
            dropna=False,
            margins=True 
)

WHYTRP1S,1,10,20,30,40,50,70,80,97,All
TRPTRANS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3,443554282026.511,253614376757.539,37849056779.98,18858138721.617,172558182187.266,135796454052.247,73956156516.144,59418503111.496,19421168549.677,1215026318702.476
4,202739048161.186,105063583803.315,19541180254.647,10346720784.702,91612356809.238,75254923813.654,45735220285.926,37513955564.111,9102905863.07,596909895339.845
5,50081960048.91,28049015999.274,3677710736.265,3123598580.384,26845200028.902,15690725851.191,16314502012.774,11001780017.514,4210026347.764,158994519622.978
6,119476610372.193,83869363215.569,5958399843.403,4954018961.88,44096555997.956,30610851083.853,14573026597.997,13837260059.023,5834520689.129,323210606821.007
8,4297674578.827,1433614089.403,197832430.726,28841918.067,1919588399.045,1361886315.167,94239282.699,772735569.077,241826921.714,10348239504.726
9,339649819.169,153220814.044,1589637065.552,3616103.167,397383637.828,157735736.428,51715731.328,75452411.156,35306250.716,2803717569.388
18,3662521643.784,2467691121.226,124191754.138,18761190.706,1933573490.198,2656884833.172,813331591.65,1288476346.355,1561006169.633,14526438140.862
All,824151746650.569,474650865800.371,68938008864.712,37333696260.523,339362840550.431,261529461685.709,151538192018.516,123908163078.732,40406760791.703,2321819735701.377


In [67]:
# create cross-tabulation for sample size:
# always always check sample size - anything below 30 is considered small sample size
pd.crosstab(trip17_vt['TRPTRANS'], trip17_vt['WHYTRP1S'],
             trip17_vt['TRPMILAD_WT'], aggfunc=lambda x: x.count(),
            dropna=False,
            margins=True
)

WHYTRP1S,1,10,20,30,40,50,70,80,97,All
TRPTRANS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3,104944.0,46346.0,9108.0,6004.0,67997.0,27424.0,20730.0,21587.0,4955.0,309095.0
4,57056.0,23437.0,4296.0,3380.0,38034.0,14999.0,14734.0,12829.0,2606.0,171371.0
5,12366.0,4942.0,1178.0,809.0,8124.0,2989.0,5018.0,2565.0,765.0,38756.0
6,30151.0,17024.0,1624.0,1340.0,19223.0,6902.0,4407.0,6007.0,1539.0,88217.0
8,710.0,333.0,41.0,24.0,349.0,262.0,18.0,143.0,50.0,1930.0
9,202.0,69.0,22.0,5.0,126.0,90.0,25.0,28.0,33.0,600.0
18,314.0,241.0,19.0,6.0,195.0,211.0,59.0,188.0,97.0,1330.0
All,205743.0,92392.0,16288.0,11568.0,134048.0,52877.0,44991.0,43347.0,10045.0,611299.0


#### 4. 1990 veh file - VEHYEAR 

In [13]:
# import 1990 veh file
path_to_veh1990 = r'E:\Demo\NPTS1990_xpt\Vehicle.xpt'
veh1990 = pd.read_sas(path_to_veh1990, format='xport', index=None, encoding="utf-8", chunksize=None, iterator=False)
len(veh1990)

41178

In [51]:
veh1990.tail()

Unnamed: 0,VEHYEAR,VEHHHOWN,VEHOWNER,VEH12MNT,VEHNEW,MAINDRVR,WHOMAIN,VEHID,VOWNFLG,OVOWNFLG,...,MSTR_MON,MSTR_YR,CENSUS_D,CENSUS_R,HHFAMINC,POVERTY,HHLOC,URBNAREA,HHSIZE,POPDNSTY
41173,89.0,1.0,94.0,2.0,1.0,1.0,2.0,3.0,,,...,8.0,90.0,5.0,3.0,99.0,99.0,2.0,3.0,4.0,2.0
41174,87.0,1.0,94.0,2.0,1.0,1.0,4.0,4.0,,,...,8.0,90.0,5.0,3.0,99.0,99.0,2.0,3.0,4.0,2.0
41175,71.0,1.0,94.0,1.0,2.0,1.0,1.0,1.0,,,...,7.0,90.0,5.0,3.0,1.0,1.0,3.0,3.0,3.0,3.0
41176,91.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,,,...,3.0,91.0,5.0,3.0,13.0,3.0,2.0,2.0,4.0,4.0
41177,84.0,1.0,94.0,2.0,1.0,1.0,2.0,2.0,,,...,3.0,91.0,5.0,3.0,13.0,3.0,2.0,2.0,4.0,4.0


In [47]:
# always quick check the min and max values in a column that needs to be "modified"
print(veh1990['VEHYEAR'].min(), veh1990['VEHYEAR'].max())
# and we see max value is 999.0. 
# That means we need to check the 1990 Users' Guide to see the full categories.

55.0 999.0


On PDF pg.106 (hardcopy c-26, file unsearchable) of 1990 Users' Guide (https://nhts.ornl.gov/1990/doc/1990UsersGuide.pdf)

VEHYEAR's categories are as follows:

- 055 = 1919-1959
- 063 = 1960-1964
- 065-091 = 19__ (year)
- 994 = Legitimate Skip
- 998 = Not Ascertained
- 999 = Refused

Note: Due to low incidence, years associated with vintage household vehicles are aggregated into one of two categories (055 and 063). These should be excluded from average vehicle age calculations.   

In [83]:
# range(55, 92) means an array containing numbers from 55 to 91. 
veh1990_sel = veh1990[veh1990['VEHYEAR'].isin(range(55,92))].copy()
# quick check the min and max
print(veh1990_sel['VEHYEAR'].min(), veh1990_sel['VEHYEAR'].max())

55.0 91.0


In [84]:
# two ways to create a 4-digit VEHYEAR_4
# method-a
# first convert VEHYEAR to integer (not float)
# then convert the int-VEHYEAR to string
# add '19' to the front
# conver the whole thing to integer 
veh1990_sel['VEHYEAR_4_a'] = veh1990_sel.apply(lambda x: int('19' + str(int(x['VEHYEAR']))), axis=1)

# quick check the first 5 rows
veh1990_sel[['VEHYEAR','VEHYEAR_4_a']].head()

Unnamed: 0,VEHYEAR,VEHYEAR_4_a
0,84.0,1984
1,85.0,1985
2,86.0,1986
3,90.0,1990
4,90.0,1990


In [85]:
# quick check the min and max
print(veh1990_sel['VEHYEAR_4_a'].min(), veh1990_sel['VEHYEAR_4_a'].max())

1955 1991


In [87]:
# method-b
# another way (as you have tried), use float number 1900.0 to do the simple addition
# why your attempt did not work: 
# possible reason: 1900 is an integer and python does not allow calculation between integers and floats
veh1990_sel['VEHYEAR_4_b'] = veh1990_sel.apply(lambda x: 1900.0 + x['VEHYEAR'], axis=1)
# quick check the first 5 rows
veh1990_sel[['VEHYEAR', 'VEHYEAR_4_b']].head()

Unnamed: 0,VEHYEAR,VEHYEAR_4_b
0,84.0,1984.0
1,85.0,1985.0
2,86.0,1986.0
3,90.0,1990.0
4,90.0,1990.0


In [88]:
# quick check the min and max
print(veh1990_sel['VEHYEAR_4_b'].min(), veh1990_sel['VEHYEAR_4_b'].max())

1955.0 1991.0
