# Navigation

Open STATA to its landing page

In [1]:
import os
os.chdir("/Applications/Stata/utilities")
from pystata import config
config.init("se")


  ___  ____  ____  ____  ____ ®
 /__    /   ____/   /   ____/      StataNow 18.5
___/   /   /___/   /   /___/       SE—Standard Edition

 Statistics and Data Science       Copyright 1985-2023 StataCorp LLC
                                   StataCorp
                                   4905 Lakeway Drive
                                   College Station, Texas 77845 USA
                                   800-782-8272        https://www.stata.com
                                   979-696-4600        service@stata.com

Stata license: Unlimited-user network, expiring  9 Sep 2025
Serial number: 501809305305
  Licensed to: Mujie
               

Notes:
      1. Unicode is supported; see help unicode_advice.
      2. Maximum number of variables is set to 5,000 but can be increased;
          see help set_maxvar.


## Calculations without a dataset
STATA supports simple arithmetic

In [2]:
%%stata
di 5 * 2
di 10 / 2


. di 5 * 2
10

. di 10 / 2
5

. 


## Opening a .dta dataset

Open your own .dta or csv file in STATA using "use FILE.dta"<br>
STATA also contains built-in and online datasets using "sysuse FILE.dta" or "webuse FILE.dta" respectively

In [3]:
%%stata
sysuse auto, clear
sum mpg


. sysuse auto, clear
(1978 automobile data)

. sum mpg

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         74     21.2973    5.785503         12         41

. 


In [4]:
import pandas as pd

babies = '/Users/mujiechen/Jupyter-Notebook/STATA/Datasets/babies.dta'
babies = pd.read_stata(babies)
print(babies.head())

import io
import requests

data = requests.get("https://www.stata.com/python/pystata18/misc/nhanes2.csv").content
nhanes2 = pd.read_csv(io.StringIO(data.decode("utf-8")))
nhanes2

   id   bwt  gest  mat_age  cigs
0   1  2100    32       41     0
1   2  3600    33       15     0
2   3  2360    33       18     0
3   4  2466    33       16     0
4   5  3840    34       28     0


Unnamed: 0,sampl,strata,psu,region,smsa,location,houssiz,sex,race,age,...,region4,smsa1,smsa2,smsa3,rural,loglead,agegrp,highlead,bmi,highbp
0,1400,1,1,S,2,1,4,Male,White,54,...,0,0,1,0,0,,50-59,,20.495686,0
1,1401,1,1,S,2,1,6,Female,White,41,...,0,0,1,0,0,2.564949,40-49,lead<25,21.022337,0
2,1402,1,1,S,1,1,6,Female,Other,21,...,0,1,0,0,0,,20-29,,24.973860,0
3,1404,1,1,S,2,1,9,Female,White,63,...,0,0,1,0,0,,60-69,,35.728722,1
4,1405,1,1,S,1,1,3,Female,White,64,...,0,1,0,0,0,2.995732,60-69,lead<25,27.923803,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10346,48760,32,2,MW,4,48,5,Female,White,35,...,0,0,0,1,1,,30-39,,20.355173,0
10347,48763,32,2,MW,4,48,2,Female,White,33,...,0,0,0,1,1,1.945910,30-39,lead<25,41.645557,1
10348,48764,32,2,MW,4,48,1,Female,White,60,...,0,0,0,1,0,,60-69,,35.626114,0
10349,48768,32,2,MW,4,48,1,Female,White,29,...,0,0,0,1,0,,20-29,,19.204464,0


## Display All Data
Using "br" or "br VARIABLE"

In [5]:
%%stata -d babies

// Redacted as runtime of "list _all" is too long


. 
. // Redacted as runtime of "list _all" is too long
. 


## One Way Table Showing Frequency of Variable

In [6]:
%%stata

tab bwt if bwt < 2000


. 
. tab bwt if bwt < 2000

        bwt |      Freq.     Percent        Cum.
------------+-----------------------------------
       1503 |          1       11.11       11.11
       1701 |          1       11.11       22.22
       1800 |          1       11.11       33.33
       1814 |          1       11.11       44.44
       1910 |          2       22.22       66.67
       1920 |          1       11.11       77.78
       1956 |          1       11.11       88.89
       1984 |          1       11.11      100.00
------------+-----------------------------------
      Total |          9      100.00

. 


## Two Way Table Showing Frequency of Two Variables
Summarises the relationship between two categorical variables, where the first argument is shown in the row, and the second argument is shown in the column<br>
", row" shows within-row relative frequencies in %

In [7]:
%%stata

tab bwt gest if bwt < 2000, row


. 
. tab bwt gest if bwt < 2000, row

+----------------+
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

           |                          gest
       bwt |        35         36         37         38         39 |     Total
-----------+-------------------------------------------------------+----------
      1503 |         1          0          0          0          0 |         1 
           |    100.00       0.00       0.00       0.00       0.00 |    100.00 
-----------+-------------------------------------------------------+----------
      1701 |         0          0          0          1          0 |         1 
           |      0.00       0.00       0.00     100.00       0.00 |    100.00 
-----------+-------------------------------------------------------+----------
      1800 |         0          0          0          0          0 |         1 
           |      0.00       0.00       0.00       0.00       0.00 |    100.00 
-------

## Reveal Structure and Properties of Dataset (a la .schema for SQL)
Each variable has:
1. Variable name, storage type (int, byte, long, float, double etc.)
2. Display format
3. Value label (key-value pairs that store numerical values but display text)
4. Variable label (a longer description of the variable e.g. specifying units)

In [8]:
%%stata
describe


Contains data
 Observations:           997                  
    Variables:             5                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
id              int     %8.0g                 
bwt             int     %8.0g                 
gest            int     %8.0g                 
mat_age         int     %8.0g                 
cigs            int     %8.0g                 
-------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.


## Show any Notes attached to the Dataset

In [9]:
%%stata
notes

## Show Variable Name, Label, and some Summary Statistics
Shows if all values are unique, contains blanks, are missing any values ("." or ".a" or "-.z")<br>
If values are unique, the variable can possibly be used as an identifier<br>
If the variable is an <i>indicator</i> variable, this function will reveal key-value pairs<br>
Indicator variables use less memory and can be worked into statistical models

In [10]:
%%stata
codebook bwt


-------------------------------------------------------------------------------
bwt                                                                 (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (int)

                 Range: [1503,4875]                   Units: 1
         Unique values: 319                       Missing .: 0/997

                  Mean: 3305.02
             Std. dev.: 505.138

           Percentiles:     10%       25%       50%       75%       90%
                           2660      3000      3300      3629      3940


## Creating and Deleting Variables

In [11]:
%%stata

gen bwt_mg = bwt*1000
drop bwt_mg

gen bwt_kg = bwt/1000
label variable bwt_kg "birth weight in kilograms"
drop if bwt_kg < 2
# Can be used to remove outliers in data


. 
. gen bwt_mg = bwt*1000

. drop bwt_mg

. 
. gen bwt_kg = bwt/1000

. label variable bwt_kg "birth weight in kilograms"

. drop if bwt_kg < 2
(9 observations deleted)

. # Can be used to remove outliers in data
Unknown #command
. 


## Dichotomise (or Stratify) Data

In [12]:
%%stata

gen bwt_strat = .  // Create the variable bwt_strat and initialize with missing values
replace bwt_strat = 0 if bwt_kg < 3  // Values in the low group become 0
replace bwt_strat = 1 if bwt_kg >= 3 & bwt_kg != .  // Values in the high group become 1


. 
. gen bwt_strat = .  // Create the variable bwt_strat and initialize with missi
> ng values
(988 missing values generated)

. replace bwt_strat = 0 if bwt_kg < 3  // Values in the low group become 0
(238 real changes made)

. replace bwt_strat = 1 if bwt_kg >= 3 & bwt_kg != .  // Values in the high gro
> up become 1
(750 real changes made)

. 


## Recode Data into Multiple Groups
Specify the categories or bins in brackets

In [13]:
%%stata

recode bwt_kg (0=0) (2/2.5=1) (2.5/3=2) (3/3.5=3) (3.5/max=4), gen(bwt_500g_splits)


. 
. recode bwt_kg (0=0) (2/2.5=1) (2.5/3=2) (3/3.5=3) (3.5/max=4), gen(bwt_500g_s
> plits)
(986 differences between bwt_kg and bwt_500g_splits)

. 


## Logic and Conditionals
STATA works with usual logical operators: ==; !=; >; <; >=; <=; & (AND); | (OR); !(NOT)

<b>Examples of Built-in Conditionals:</b><br>
gen bwt_selected inlist(VARIABLE, "2323", "2738", "3242") <- inlist qualifier<br>
count if bwt == 2311<br>
br if missing(VARIABLE)<br>
list VARIABLE if > n<br>
by VARIABLE2, sort: summ VARIABLE1 // where subgroups are separated by VARIABLE1 and summary produced for VARIABLE2

To keep track of work, use the log button (lab notebook icon) to begin a log
