Matthieu Gomez edited this page Apr 28, 2016 · 36 revisions

Stata Pitfalls

A list of common situations where Stata returns something different from what you may have in mind.

Feel free to share your own stories using the "Edit" button at the top right corner (don't forget to add your name in the list of contributors).

Missing values

Sort

  • Missing observations for numeric variables are sorted last. - In particular, the following does not always return the maximum time by id: bysort id (time): gen temp = time[_N]

      The variable temp can be missing if time is missing in some id. Instead, use
    
      ```
      egen temp = max(time), by(id)
      ```
    
  • In contrast, missing observations for string variables are sorted first:

     . clear
     . input str5 strvar
     . a
     . b
     . "" 
     . g
     . c
     . end
     . sort strvar
     . list
    
          +--------+
          | strvar |
          |--------|
       1. |        |
       2. |      a |
       3. |      b |
       4. |      c |
       5. |      g |
          +--------+
    

Arithmetic operations

  • Explicit arithmetic operations containing missing values return missing values

     . clear
     . set obs 1
     . gen a = 1
     . gen b = .
     . gen sum = a + b
     . gen product = a * b
     list
        +-----------------------+
        | a   b   sum   product |
        |-----------------------|
     1. | 1   .     .         . |
        +-----------------------+
    
  • In contrast, functions such as max and min systematically omit missing values

    . clear
    . set obs 1
    . gen a = 1
    . gen b = .
    . gen max = max(a, b)
    . gen min = min(a, b)
    . list
    +---------------------+
    | a   b   max    min  |
    |---------------------|
    | 1   .    1      1   |
    +---------------------+
    
  • collapse (sum) and egen sum, and rowsum transform missing values to zero.

     - In particular, when all observations are missing within a group, the sum of a variable is *zero*, not a missing value. 	The issue is discussed on the Statalist [here](http://www.stata.com/statalist/archive/2004-07/msg00779.html), [here](http://www.stata.com/statalist/archive/2007-10/msg00806.html), [here](http://www.stata.com/statalist/archive/2010-02/msg00422.html)
    
    
    ```
    . clear
    . set obs 1
    . gen a = .
    . collapse (sum) sum = a (mean) mean = a (sd) sd = a
    . list
       +-----------------+
       | sum   mean   sd |
       |-----------------|
    1. |   0      .    . |
       +-----------------+
    ```
    In contrast, the functions `mean` or `sd` return missing values : internally, these commands evaluate an expression of the form `0/0` (the denominator corresponds to the number of non missing observations) which evaluates to missing in Stata. To return a missing value when all observations are missing, you can use a temporary variable to count non-missing observations.
    
    ```
    . clear
    . set obs 1
    . gen a = .
    . gen nonmissing = !missing(a)
    . collapse (sum) sum = a  anynonmissing = nonmissing
    . replace a = . if anynonmissing == 0
    . list
    
        +--------------+
        | a   nonmis~g |
        |--------------|
     1. | .          0 |
        +--------------+
    ```
    

Booleans

Stata considers any numeric value different from zero to be true. In particular, missing values are true. For instance, in the following command, the last condition evaluates to true for all observations

. clear
. set obs 10
. gen condition = 1 if _n >= 5
. keep if condition
(0 observations deleted)

To avoid this issue, create boolean variable alternatively equal 1 and 0, rather than 1 and missing.

. clear
. set obs 10
. gen condition = _n >= 5
. keep if condition
(4 observations deleted)

Egen and collapse

  • collapse can give unexpected results when using weights. Observations with missing or zero weights are removed from all the computations, even those in (rawsum)

     . clear
     . set obs 2
     . gen a  = 1
     . gen w = 1 if _n == 1
     . collapse (rawsum) a [w = w]
        +---+
        | a |
        |---|
     1. | 1 |
        +---+
    

    The topic is discussed on the Statalist here, and here

  • collapse (sum) [aw] (the default) and collapse (sum) [pw] returns\sum w_i v_i/\sum w_i * _N

    . clear
    . set obs 2
    . gen a  = _n
    . gen b = 1
    . gen w = _n
    . collapse (sum) a [aw = w]
    

    If you want collapse(sum) to return \sum w_i v_i, use iweight or fweight

    . clear
    . set obs 2
    . gen a  = _n
    . gen b = 1
    . gen w = _n
    . collapse (sum) a [iw = w]
    
  • Be careful when running regressions on collapsed groups, or when constructing leave-one-out regressors. As an example, the following code generates y and x such that y = 2*x + runiform(), then set x to missing every 13 rows, and y to missing every 11 rows.

     . clear
     . set seed 10
     . set obs 10000
     . gen a = mod(_n, 1000)
     . gen x = uniform()
     . gen y = 2 * x + uniform()
     . replace y = . if mod(_n, 11) == 0
     . replace x = . if mod(_n, 13) == 0
     . collapse (mean) x y, by (a)
    

    Regressing y on x yields a biased estimate for b:

     . reg y x
           Source |       SS       df       MS              Number of obs =    1000
     -------------+------------------------------           F(  1,   998) = 2330.48
            Model |  32.6328302     1  32.6328302           Prob > F      =  0.0000
         Residual |  13.9746306   998  .014002636           R-squared     =  0.7002
     -------------+------------------------------           Adj R-squared =  0.6999
            Total |  46.6074608   999  .046654115           Root MSE      =  .11833
     
     ------------------------------------------------------------------------------
                y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
     -------------+----------------------------------------------------------------
                x |   1.864172   .0386157    48.28   0.000     1.788395    1.939949
            _cons |   .5655158   .0195333    28.95   0.000     .5271848    .6038469
     ------------------------------------------------------------------------------
    

    To avoid this situation, first flag/remove all missing observations where any of the variable is missing

     . clear
     . set seed 10
     . set obs 10000
     . gen a = mod(_n, 1000)
     . gen x = uniform()
     . gen y = 2 * x + uniform()
     . replace y = . if mod(_n, 11) == 0
     . replace x = . if mod(_n, 13) == 0
     . keep if !missing(x) & !missing(y)
     . collapse (mean) x y, by (a)
     . reg y x
     
           Source |       SS       df       MS              Number of obs =    1000
     -------------+------------------------------           F(  1,   998) = 4291.64
            Model |  42.2154102     1  42.2154102           Prob > F      =  0.0000
         Residual |  9.81699173   998  .009836665           R-squared     =  0.8113
     -------------+------------------------------           Adj R-squared =  0.8111
            Total |  52.0324019   999  .052084486           Root MSE      =  .09918
     
     ------------------------------------------------------------------------------
                y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
     -------------+----------------------------------------------------------------
                x |   2.002737   .0305712    65.51   0.000     1.942746    2.062728
            _cons |   .4996378   .0154767    32.28   0.000     .4692672    .5300085
     ------------------------------------------------------------------------------
    

Merge

  • merge m:m does not create a dataset with all the possible combinations between the master and the using dataset. To do a "true" m:m join (i.e. a SQL Outer join or in R, a default merge), use joinby. The issue is discussed here.

  • When using the option update, the variable _merge takes values from 1 to 5

     not matched                                 
             from master                           (_merge==1)
             from using                            (_merge==2)
     matched                                     
             not updated                           (_merge==3)
             missing updated                       (_merge==4)
             nonmissing conflict                   (_merge==5)
    
    

    In particular, usual follow up commands such as keep if _merge == 3 takes a different meaning with the option update. To avoid this problem, directly use the keep option in merge

     . merge 1:1 using usingfile, update keep(matched) 
    

    This returns the same result as keep if inrange(_merge, 3, 5)

Panel Data

  • To lag a variable v in an unbalanced panel, use the operators L.v and F.v (which return the variable value at the past/future period) rather than v[_n-1] and v[_n+1] (which return the variable value at the previous/next row)

Floats

Variables in Stata are stored in float (called "single precision" or Float32 in other languages) by default, which creates some specific issues.

Float vs integer

Floats can store integers accurately only up to 2^24 = 16,777,216. In other words, integers higher than 2^24 are rounded up or down. This makes float variables problematic to use as id variables when some id numbers are higher than 15 million (either because you have more than 16 million unique groups, or because you use EIN or SSN as identifiers). In these cases, be sure to store these id variables as long (up to 2 billions) or double rather than float. Below are three common situations:

  • Creating an identifying variable

     . clear
     . set obs 20000000
     . gen id = _n
    

    The id variable is, by default, a float. Therefore, it does not identify each row.

     . distinct id
             |        Observations
             |      total   distinct
      -------+----------------------
          id |   2.00e+07   1.84e+07
    

    Rather, use long in the generate command

     . clear
     . set obs 20000000
     . gen long id = _n
     . distinct id
            |        Observations
            |      total   distinct
     -------+----------------------
         id |   2.00e+07   2.00e+07
    

    The same issue can be encountered when working with datetimes since datetimes are stored as the number of milliseconds since 01/01/1960)

      . clear
      . set obs 1
      . gen datetime = mdyhms(01, 01, 2001, 01, 00, 0)
      .  display %tc datetime[1]
       01jan2001 00:58:58
    
  • Converting a string to an integer, real() converts the string variable to a float by default. Rather, use

     . gen long id_int = real(id)
     ```
     or
     ```
     . destring id
     ```
     which automatically creates a double
    
    
    
  • Importing a csv file. If the csv file contains integers that can't be stored in a float type without loss of accuracy, force all variables to be imported as double

     . import delimited filename, asdouble
    

    Once the dataset is imported, you may convert the non-id variables from double to float using recast

Float vs double

Floats should not be used in equality conditions

. clear
. set obs 1
. gen a = 1.1
. display a[1] == 1.1
0

While a is a stored as a float, 1.1 is a double, and therefore they are not exactly equal. A solution is to apply the float function to every non-float expression

. display a[1] == float(1.1)
1

A better solution is probably to avoid comparing floating numbers to other floating numbers (i.e. convert these variables to strings or as long integers with a value label).

Types issues are discusses here

Strings

  • Regex: Stata does not support positive / negative look aheads / behind, nor counting operators (using braces).

Contributors

This list was made with contributions from:

  • Roberto Ferrer @refp16
  • Joe Canner
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.