## Demographics and Employment in the United States

In the wake of the Great Recession of 2009, there has been a good deal of focus on employment statistics, one of the most important metrics policymakers use to gauge the overall strength of the economy. In the United States, the government measures unemployment using the Current Population Survey (CPS), which collects demographic and employment information from a wide range of Americans each month. In this exercise, we will employ the topics reviewed in the lectures as well as a few new techniques using the September 2013 version of this rich, nationally representative dataset (available online).

The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey. While the full dataset has 385 variables, in this exercise we will use a more compact version of the dataset, CPSData.csv, which has the following variables:

- PeopleInHousehold: The number of people in the interviewee's household.

- Region: The census region where the interviewee lives.

- State: The state where the interviewee lives.

- MetroAreaCode: A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area). The mapping from codes to names of metropolitan areas is provided in the file MetroAreaCodes.csv.

- Age: The age, in years, of the interviewee. 80 represents people aged 80-84, and 85 represents people aged 85 and higher.

- Married: The marriage status of the interviewee.

- Sex: The sex of the interviewee.

- Education: The maximum level of education obtained by the interviewee.

- Race: The race of the interviewee.

- Hispanic: Whether the interviewee is of Hispanic ethnicity.

- CountryOfBirthCode: A code identifying the country of birth of the interviewee. The mapping from codes to names of countries is provided in the file CountryCodes.csv.

- Citizenship: The United States citizenship status of the interviewee.

- EmploymentStatus: The status of employment of the interviewee.

- Industry: The industry of employment of the interviewee (only available if they are employed).

In [1]:
CPS <- read.csv('dataset/CPSData.csv')

In [2]:
str(CPS)

'data.frame':	131302 obs. of  14 variables:
 $ PeopleInHousehold : int  1 3 3 3 3 3 3 2 2 2 ...
 $ Region            : Factor w/ 4 levels "Midwest","Northeast",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ State             : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ MetroAreaCode     : int  26620 13820 13820 13820 26620 26620 26620 33660 33660 26620 ...
 $ Age               : int  85 21 37 18 52 24 26 71 43 52 ...
 $ Married           : Factor w/ 5 levels "Divorced","Married",..: 5 3 3 3 5 3 3 1 1 3 ...
 $ Sex               : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 2 1 2 2 ...
 $ Education         : Factor w/ 8 levels "Associate degree",..: 1 4 4 6 1 2 4 4 4 2 ...
 $ Race              : Factor w/ 6 levels "American Indian",..: 6 3 3 3 6 6 6 6 6 6 ...
 $ Hispanic          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CountryOfBirthCode: int  57 57 57 57 57 57 57 57 57 57 ...
 $ Citizenship       : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ EmploymentSt

In [3]:
sort(table(CPS$Industry))


                               Armed forces 
                                         29 
                                     Mining 
                                        550 
Agriculture, forestry, fishing, and hunting 
                                       1307 
                                Information 
                                       1328 
                      Public administration 
                                       3186 
                             Other services 
                                       3224 
               Transportation and utilities 
                                       3260 
                                  Financial 
                                       4347 
                               Construction 
                                       4387 
                    Leisure and hospitality 
                                       6364 
                              Manufacturing 
                                       6791 
         

In [4]:
sort(table(CPS$State))


          New Mexico              Montana          Mississippi 
                1102                 1214                 1230 
             Alabama        West Virginia             Arkansas 
                1376                 1409                 1421 
           Louisiana                Idaho             Oklahoma 
                1450                 1518                 1523 
             Arizona               Alaska              Wyoming 
                1528                 1590                 1624 
        North Dakota       South Carolina            Tennessee 
                1645                 1658                 1784 
District of Columbia             Kentucky                 Utah 
                1791                 1841                 1842 
              Nevada              Vermont               Kansas 
                1856                 1890                 1935 
              Oregon             Nebraska        Massachusetts 
                1943                 19

In [5]:
table(CPS$Citizenship)


     Citizen, Native Citizen, Naturalized          Non-Citizen 
              116639                 7073                 7590 

In [6]:
(116639 + 7073) / (116639 + 7073 + 7590)

In [7]:
hispanic = subset(CPS, Hispanic == 1)

In [8]:
table(hispanic$Race)


 American Indian            Asian            Black      Multiracial 
             304              113              621              448 
Pacific Islander            White 
              77            16731 

In [9]:
table(CPS$Race, CPS$Hispanic)

                  
                       0     1
  American Indian   1129   304
  Asian             6407   113
  Black            13292   621
  Multiracial       2449   448
  Pacific Islander   541    77
  White            89190 16731

In [10]:
summary(CPS)

 PeopleInHousehold       Region               State       MetroAreaCode  
 Min.   : 1.000    Midwest  :30684   California  :11570   Min.   :10420  
 1st Qu.: 2.000    Northeast:25939   Texas       : 7077   1st Qu.:21780  
 Median : 3.000    South    :41502   New York    : 5595   Median :34740  
 Mean   : 3.284    West     :33177   Florida     : 5149   Mean   :35075  
 3rd Qu.: 4.000                      Pennsylvania: 3930   3rd Qu.:41860  
 Max.   :15.000                      Illinois    : 3912   Max.   :79600  
                                     (Other)     :94069   NA's   :34238  
      Age                 Married          Sex       
 Min.   : 0.00   Divorced     :11151   Female:67481  
 1st Qu.:19.00   Married      :55509   Male  :63821  
 Median :39.00   Never Married:30772                 
 Mean   :38.83   Separated    : 2027                 
 3rd Qu.:57.00   Widowed      : 6505                 
 Max.   :85.00   NA's         :25338                 
                              

In [11]:
table(CPS$Region, is.na(CPS$Married))

           
            FALSE  TRUE
  Midwest   24609  6075
  Northeast 21432  4507
  South     33535  7967
  West      26388  6789

In [12]:
table(CPS$Sex, is.na(CPS$Married))

        
         FALSE  TRUE
  Female 55264 12217
  Male   50700 13121

In [13]:
table(CPS$Age, is.na(CPS$Married))

    
     FALSE TRUE
  0      0 1283
  1      0 1559
  2      0 1574
  3      0 1693
  4      0 1695
  5      0 1795
  6      0 1721
  7      0 1681
  8      0 1729
  9      0 1748
  10     0 1750
  11     0 1721
  12     0 1797
  13     0 1802
  14     0 1790
  15  1795    0
  16  1751    0
  17  1764    0
  18  1596    0
  19  1517    0
  20  1398    0
  21  1525    0
  22  1536    0
  23  1638    0
  24  1627    0
  25  1604    0
  26  1643    0
  27  1657    0
  28  1736    0
  29  1645    0
  30  1854    0
  31  1762    0
  32  1790    0
  33  1804    0
  34  1653    0
  35  1716    0
  36  1663    0
  37  1531    0
  38  1530    0
  39  1542    0
  40  1571    0
  41  1673    0
  42  1711    0
  43  1819    0
  44  1764    0
  45  1749    0
  46  1665    0
  47  1647    0
  48  1791    0
  49  1989    0
  50  1966    0
  51  1931    0
  52  1935    0
  53  1994    0
  54  1912    0
  55  1895    0
  56  1935    0
  57  1827    0
  58  1874    0
  59  1758    0
  60  1746    0
  6

In [14]:
table(CPS$State, is.na(CPS$MetroAreaCode))

                      
                       FALSE  TRUE
  Alabama               1020   356
  Alaska                   0  1590
  Arizona               1327   201
  Arkansas               724   697
  California           11333   237
  Colorado              2545   380
  Connecticut           2593   243
  Delaware              1696   518
  District of Columbia  1791     0
  Florida               4947   202
  Georgia               2250   557
  Hawaii                1576   523
  Idaho                  761   757
  Illinois              3473   439
  Indiana               1420   584
  Iowa                  1297  1231
  Kansas                1234   701
  Kentucky               908   933
  Louisiana             1216   234
  Maine                  909  1354
  Maryland              2978   222
  Massachusetts         1858   129
  Michigan              2517   546
  Minnesota             2150   989
  Mississippi            376   854
  Missouri              1440   705
  Montana                199  10

In [15]:
table(CPS$Region, is.na(CPS$MetroAreaCode))

           
            FALSE  TRUE
  Midwest   20010 10674
  Northeast 20330  5609
  South     31631  9871
  West      25093  8084

In [16]:
sort(tapply(is.na(CPS$MetroAreaCode), CPS$State, mean))

In [17]:
MetroAreaCode <- read.csv('dataset/MetroAreaCodes.csv')
CountryCode <- read.csv('dataset/CountryCodes.csv')

In [18]:
str(MetroAreaCode)

'data.frame':	271 obs. of  2 variables:
 $ Code     : int  460 3000 3160 3610 3720 6450 10420 10500 10580 10740 ...
 $ MetroArea: Factor w/ 271 levels "Akron, OH","Albany-Schenectady-Troy, NY",..: 12 92 97 117 122 195 1 3 2 4 ...


In [19]:
str(CountryCode)

'data.frame':	149 obs. of  2 variables:
 $ Code   : int  57 66 73 78 96 100 102 103 104 105 ...
 $ Country: Factor w/ 149 levels "Afghanistan",..: 139 57 105 135 97 3 11 18 24 37 ...


In [20]:
CPS = merge(CPS, MetroAreaCode, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)

In [21]:
summary(CPS)

 MetroAreaCode   PeopleInHousehold       Region               State      
 Min.   :10420   Min.   : 1.000    Midwest  :30684   California  :11570  
 1st Qu.:21780   1st Qu.: 2.000    Northeast:25939   Texas       : 7077  
 Median :34740   Median : 3.000    South    :41502   New York    : 5595  
 Mean   :35075   Mean   : 3.284    West     :33177   Florida     : 5149  
 3rd Qu.:41860   3rd Qu.: 4.000                      Pennsylvania: 3930  
 Max.   :79600   Max.   :15.000                      Illinois    : 3912  
 NA's   :34238                                       (Other)     :94069  
      Age                 Married          Sex       
 Min.   : 0.00   Divorced     :11151   Female:67481  
 1st Qu.:19.00   Married      :55509   Male  :63821  
 Median :39.00   Never Married:30772                 
 Mean   :38.83   Separated    : 2027                 
 3rd Qu.:57.00   Widowed      : 6505                 
 Max.   :85.00   NA's         :25338                 
                              

In [22]:
str(CPS)

'data.frame':	131302 obs. of  15 variables:
 $ MetroAreaCode     : int  10420 10420 10420 10420 10420 10420 10420 10420 10420 10420 ...
 $ PeopleInHousehold : int  4 4 2 4 1 3 4 4 2 3 ...
 $ Region            : Factor w/ 4 levels "Midwest","Northeast",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ State             : Factor w/ 51 levels "Alabama","Alaska",..: 36 36 36 36 36 36 36 36 36 36 ...
 $ Age               : int  2 9 73 40 63 19 30 6 60 32 ...
 $ Married           : Factor w/ 5 levels "Divorced","Married",..: NA NA 2 2 3 3 2 NA 2 2 ...
 $ Sex               : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 1 1 1 2 ...
 $ Education         : Factor w/ 8 levels "Associate degree",..: NA NA 8 4 6 4 2 NA 4 4 ...
 $ Race              : Factor w/ 6 levels "American Indian",..: 6 6 6 6 6 6 2 6 6 6 ...
 $ Hispanic          : int  0 0 0 0 0 0 0 1 0 0 ...
 $ CountryOfBirthCode: int  57 57 57 362 57 57 203 57 57 57 ...
 $ Citizenship       : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 2 1 1 3 1 1 1 ...


In [23]:
sort(table(CPS$MetroArea))


                       Appleton-Oshkosh-Neenah, WI 
                                                 0 
                 Grand Rapids-Muskegon-Holland, MI 
                                                 0 
               Greenville-Spartanburg-Anderson, SC 
                                                 0 
                       Hinesville-Fort Stewart, GA 
                                                 0 
                                     Jamestown, NY 
                                                 0 
                        Kalamazoo-Battle Creek, MI 
                                                 0 
                       Portsmouth-Rochester, NH-ME 
                                                 0 
                                 Bowling Green, KY 
                                                29 
                                    Ocean City, NJ 
                                                30 
                                   Springfield, OH 
           

In [24]:
sort(tapply(CPS$Hispanic, CPS$MetroArea, mean))

In [25]:
sort(tapply(CPS$Race == 'Asian', CPS$MetroArea, mean))

In [26]:
sort(tapply(CPS$Education == "No high school diploma", CPS$MetroArea, mean, na.rm=TRUE))

In [40]:
CPS <- read.csv('dataset/CPSData.csv')
CPS = merge(CPS, MetroAreaCode, by.x="MetroAreaCode", by.y="Code", all.x=TRUE)
str(CPS)

'data.frame':	131302 obs. of  15 variables:
 $ MetroAreaCode     : int  10420 10420 10420 10420 10420 10420 10420 10420 10420 10420 ...
 $ PeopleInHousehold : int  4 4 2 4 1 3 4 4 2 3 ...
 $ Region            : Factor w/ 4 levels "Midwest","Northeast",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ State             : Factor w/ 51 levels "Alabama","Alaska",..: 36 36 36 36 36 36 36 36 36 36 ...
 $ Age               : int  2 9 73 40 63 19 30 6 60 32 ...
 $ Married           : Factor w/ 5 levels "Divorced","Married",..: NA NA 2 2 3 3 2 NA 2 2 ...
 $ Sex               : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 1 1 1 2 ...
 $ Education         : Factor w/ 8 levels "Associate degree",..: NA NA 8 4 6 4 2 NA 4 4 ...
 $ Race              : Factor w/ 6 levels "American Indian",..: 6 6 6 6 6 6 2 6 6 6 ...
 $ Hispanic          : int  0 0 0 0 0 0 0 1 0 0 ...
 $ CountryOfBirthCode: int  57 57 57 362 57 57 203 57 57 57 ...
 $ Citizenship       : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 2 1 1 3 1 1 1 ...


In [41]:
CPS = merge(CPS, CountryCode, by.x="CountryOfBirthCode", by.y="Code", all.x=TRUE)
str(CPS)

'data.frame':	131302 obs. of  16 variables:
 $ CountryOfBirthCode: int  57 57 57 57 57 57 57 57 57 57 ...
 $ MetroAreaCode     : int  10420 71650 10420 10420 10420 10420 10420 10420 10420 10420 ...
 $ PeopleInHousehold : int  2 4 5 2 2 3 1 3 4 4 ...
 $ Region            : Factor w/ 4 levels "Midwest","Northeast",..: 1 2 1 1 1 1 1 1 1 1 ...
 $ State             : Factor w/ 51 levels "Alabama","Alaska",..: 36 30 36 36 36 36 36 36 36 36 ...
 $ Age               : int  73 5 10 30 30 0 34 32 6 9 ...
 $ Married           : Factor w/ 5 levels "Divorced","Married",..: 2 NA NA 2 2 NA 1 2 NA NA ...
 $ Sex               : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 2 1 2 ...
 $ Education         : Factor w/ 8 levels "Associate degree",..: 8 NA NA 1 2 NA 4 4 NA NA ...
 $ Race              : Factor w/ 6 levels "American Indian",..: 6 6 6 6 6 6 6 6 6 6 ...
 $ Hispanic          : int  0 0 0 0 0 0 0 0 1 0 ...
 $ Citizenship       : Factor w/ 3 levels "Citizen, Native",..: 1 1 1 1 1 1 1 1 1 1 ...

In [42]:
summary(CPS)

 CountryOfBirthCode MetroAreaCode   PeopleInHousehold       Region     
 Min.   : 57.00     Min.   :10420   Min.   : 1.000    Midwest  :30684  
 1st Qu.: 57.00     1st Qu.:21780   1st Qu.: 2.000    Northeast:25939  
 Median : 57.00     Median :34740   Median : 3.000    South    :41502  
 Mean   : 82.68     Mean   :35075   Mean   : 3.284    West     :33177  
 3rd Qu.: 57.00     3rd Qu.:41860   3rd Qu.: 4.000                     
 Max.   :555.00     Max.   :79600   Max.   :15.000                     
                    NA's   :34238                                      
          State            Age                 Married          Sex       
 California  :11570   Min.   : 0.00   Divorced     :11151   Female:67481  
 Texas       : 7077   1st Qu.:19.00   Married      :55509   Male  :63821  
 New York    : 5595   Median :39.00   Never Married:30772                 
 Florida     : 5149   Mean   :38.83   Separated    : 2027                 
 Pennsylvania: 3930   3rd Qu.:57.00   Widowed    

In [49]:
sort(table(CPS$Country), decreasing=TRUE)


                 United States                         Mexico 
                        115063                           3921 
                   Philippines                          India 
                           839                            770 
                         China                    Puerto Rico 
                           581                            518 
                   El Salvador                        Vietnam 
                           477                            458 
                       Germany                           Cuba 
                           438                            426 
                        Canada                          Korea 
                           410                            334 
            Dominican Republic                      Guatemala 
                           330                            309 
                       Jamaica                       Columbia 
                           217                        

In [50]:
table(CPS$MetroArea == "New York-Northern New Jersey-Long Island, NY-NJ-PA", CPS$Country != "United States")

       
        FALSE  TRUE
  FALSE 78757 12744
  TRUE   3736  1668

In [51]:
1668 / (3736 + 1668)