# Export Capstone Project: Data Cleaning and Preparation
#### Josh Barker
#### DTSC 691: Data Science Capstone: Applied Data Science
#### Eastern University
#### Spring 2025

## Data Cleaning and Preparation Notebook Overview
We have already pulled together data from a variety of different sources into a single dataframe. Now, we need to prepare it to use to train our machine learning models. 

In this notebook, we will:
* Import the `metro_exports_joined.csv` file created from merged dataframes in the Data Import and Merge notebook.
* Systematically analyze values in each variable, replacing null placeholders with null, removing text and non-numerical characters from variables that should include only numbers, and removing irrelevant text.
* Identifying null values and filling them or dropping rows.
* Conduct Feature Engineering

### Import Necessary Libraries
Below are all the libraries needed to run this notebook:

In [1]:
import numpy as np
import pandas as pd

## Import Data
Below, we will import the final merged dataframe from the last notebook and take a look at some of the summary statistics.

In [2]:
metro_exports_all = pd.read_csv("metro_exports_joined.csv")

In [3]:
metro_exports_all.describe()

Unnamed: 0,Year,exports,Per_Capita_Income,Personal_Income,Population,FHFA_index_Q1,FHFA_index_Q2,FHFA_index_Q3,FHFA_index_Q4,Employment,...,Policy_Anti-Price_Gouging,Minimum_Wage,USD_to_Euro,USD_to_Pound,USD_to_Peso,USD_to_Yuan,USD_to_Yen,S&P500_Average,S&P500_Close,DJIA_close
count,7227.0,7227.0,6990.0,6990.0,6990.0,6964.0,6964.0,6964.0,6962.0,6914.0,...,6677.0,7186.0,7227.0,7227.0,7227.0,7227.0,7227.0,7227.0,7227.0,7227.0
mean,2014.091739,3875634000.0,43889.905436,37552770.0,734272.2,202.813287,206.065566,208.695803,210.033864,347723.8,...,0.833009,7.777581,0.81816,0.672866,15.495438,6.824843,107.582682,2229.019188,2332.020224,20055.573772
std,5.459234,13189290000.0,12718.371514,100212900.0,1629250.0,60.28652,64.307191,66.334523,66.933069,778878.1,...,0.366923,1.912532,0.078972,0.095538,3.67354,0.545271,14.951963,1091.079358,1196.823599,9107.245223
min,2005.0,2475126.0,18066.0,1367137.0,54600.0,100.22,100.79,101.42,101.33,17205.0,...,0.0,2.65,0.683075,0.499806,10.892538,6.152292,79.829741,969.183333,825.9,8776.39
25%,2009.0,216536000.0,35276.5,5641341.0,147274.0,161.3075,162.0175,163.1375,163.745,65302.5,...,1.0,7.25,0.753045,0.607353,12.429192,6.451678,97.589811,1282.941667,1312.4,12217.56
50%,2014.0,610679200.0,41361.5,10643980.0,252090.0,184.57,185.995,187.435,188.495,117423.0,...,1.0,7.25,0.803857,0.647491,13.502932,6.756806,109.007953,1962.114167,1940.24,17425.03
75%,2019.0,2054501000.0,49742.75,25812100.0,597459.8,227.2975,231.465,234.8775,236.9,270825.0,...,1.0,8.1,0.892882,0.776691,19.227887,6.952764,116.321732,2981.413333,3225.52,28538.44
max,2023.0,233550200000.0,148036.0,1788676000.0,20133110.0,565.23,573.28,569.17,567.44,9653313.0,...,1.0,15.74,0.951098,0.811347,21.46617,8.19495,140.510745,4386.741667,4845.65,37689.54


In [4]:
metro_exports_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7227 entries, 0 to 7226
Data columns (total 65 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   MSA                            7227 non-null   object 
 1   Year                           7227 non-null   int64  
 2   exports                        7227 non-null   float64
 3   Per_Capita_Income              6990 non-null   float64
 4   Personal_Income                6990 non-null   float64
 5   Population                     6990 non-null   float64
 6   FHFA_index_Q1                  6964 non-null   float64
 7   FHFA_index_Q2                  6964 non-null   float64
 8   FHFA_index_Q3                  6964 non-null   float64
 9   FHFA_index_Q4                  6962 non-null   float64
 10  Employment                     6914 non-null   float64
 11  Labor_Force                    6914 non-null   float64
 12  Unemployment_Rate              6914 non-null   f

### Verify Nulls
Let's start by verifying that the null values that we see are not due to issues in joining tables. Some rows would be expected to have null values, due to missing data, for instance. 

First, we can see from the summary above the all our MSAs, years, exports, and nationwide variables (exchange rates, S&P, and Dow) have no missing values. 

Next, we see our BEA data has 237 null rows. What makes up these rows?

In [5]:
metro_exports_all[metro_exports_all["Personal_Income"].isnull()]["MSA"].unique()

array(['Aguadilla, PR', 'Anderson, IN', 'Anderson, SC', 'Arecibo, PR',
       'Danville, VA', 'Fajardo, PR', 'Guayama, PR',
       'Holland-Grand Haven, MI', 'Mayaguez, PR', 'Palm Coast, FL',
       'Pascagoula, MS', 'Ponce, PR', 'San German, PR',
       'San Juan-Bayamon-Caguas, PR', 'Sandusky, OH', 'Yauco, PR',
       'Z-Non Metropolitan Areas', 'Z-Other Metropolitan Areas',
       'Z-Unknown'], dtype=object)

Of the 34 MSA values without BEA variables, 4 are the ITA's miscellaneous areas, all beginning with `Z-` that would not correspond to an actual MSA or an area for which the BEA estimates these variables. 9 are for MSAs in Puerto Rico, for which the BEA does not maintain records. The remaining metro areas are too small for the BEA to collect metro-area-level metrics. 

We know that the 4 miscellaneous areas and Puerto Rico will not have data for most of the sources, so we will drop these rows. 

In [6]:
metro_exports_all = metro_exports_all[~metro_exports_all["MSA"].str.contains("Z-")]
metro_exports_all = metro_exports_all[~metro_exports_all["MSA"].str.contains(", PR")]

Now, let's review the new row counts and null-values

In [7]:
metro_exports_all.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 0 to 7185
Data columns (total 65 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   MSA                            7043 non-null   object 
 1   Year                           7043 non-null   int64  
 2   exports                        7043 non-null   float64
 3   Per_Capita_Income              6990 non-null   float64
 4   Personal_Income                6990 non-null   float64
 5   Population                     6990 non-null   float64
 6   FHFA_index_Q1                  6964 non-null   float64
 7   FHFA_index_Q2                  6964 non-null   float64
 8   FHFA_index_Q3                  6964 non-null   float64
 9   FHFA_index_Q4                  6962 non-null   float64
 10  Employment                     6816 non-null   float64
 11  Labor_Force                    6816 non-null   float64
 12  Unemployment_Rate              6816 non-null   float6

In [8]:
metro_exports_all[metro_exports_all["Per_Capita_Income"].isnull()]["MSA"].unique()

array(['Anderson, IN', 'Anderson, SC', 'Danville, VA',
       'Holland-Grand Haven, MI', 'Palm Coast, FL', 'Pascagoula, MS',
       'Sandusky, OH'], dtype=object)

In examining the remaining 7 MSAs lacking BEA data, we can see that the BEA does not produce statistics for these areas, so the missing data is accurate. 

Now, turning to the FHFA data, we can see that we're missing 60 values in Q1 and 62 in Q4. When it comes to unique MSAs missing, there are 
7 in Q1 and 9 in Q4. Other than Sandusky, OH, which the FHFA discontinued in 2013, the other 6 in Q1 were never collected by the FHFA. We can see that the 2 MSAs that show up in Q4 but not Q1 are only missing a single value--2022 Q4. 

In [9]:
metro_exports_all[metro_exports_all["FHFA_index_Q1"].isnull()]["MSA"].unique()

array(['Beckley, WV', 'California-Lexington Park, MD', 'Chambersburg, PA',
       'Homosassa Springs, FL', 'Kahului-Wailuku, HI', 'Sandusky, OH',
       'Twin Falls, ID', 'Urban Honolulu, HI'], dtype=object)

In [10]:
metro_exports_all[metro_exports_all["FHFA_index_Q4"].isnull()]["MSA"].unique()

array(['Beckley, WV', 'California-Lexington Park, MD', 'Chambersburg, PA',
       'Homosassa Springs, FL', 'Ithaca, NY', 'Kahului-Wailuku, HI',
       'Pine Bluff, AR', 'Sandusky, OH', 'Twin Falls, ID',
       'Urban Honolulu, HI'], dtype=object)

In [11]:
metro_exports_all[(metro_exports_all["MSA"] == "Pine Bluff, AR") | (metro_exports_all["MSA"] == "Ithaca, NY")][metro_exports_all["FHFA_index_Q4"].isnull()]["Year"].unique()

  metro_exports_all[(metro_exports_all["MSA"] == "Pine Bluff, AR") | (metro_exports_all["MSA"] == "Ithaca, NY")][metro_exports_all["FHFA_index_Q4"].isnull()]["Year"].unique()


array([2022])

We can fill in the single missing values with the value of the quarter preceding it.

In [12]:
#metro_exports_all[metro_exports_all["MSA"] == "Pine Bluff, AR"][metro_exports_all["Year"] == 2022]["FHFA_index_Q4"] = (metro_exports_all[metro_exports_all["MSA"] == "Pine Bluff, AR"][metro_exports_all["Year"] == 2021].iloc[:,9:10].sum(axis = 1).values + metro_exports_all[metro_exports_all["MSA"] == "Pine Bluff, AR"][metro_exports_all["Year"] == 2022].iloc[:,6:9].sum(axis = 1).values)/4

In [13]:
metro_exports_all["FHFA_index_Q4"] = metro_exports_all["FHFA_index_Q4"].fillna(metro_exports_all["FHFA_index_Q3"])

We can see that we have 246 null values for our BLS data. We will fill the null values for `Unemployment Rate` with national average unemployment rate for that year. It would be trickier to fill the other three BLS columns, which are mathematically related to the region's population. 

In [14]:
metro_exports_all["Employment"].isnull().sum()

np.int64(227)

In [15]:
bls_null = metro_exports_all.groupby("MSA")["Employment"].agg(lambda x: x.isnull().sum())
bls_null[bls_null > 0]

MSA
Anderson, IN                             8
Anderson, SC                             8
Bloomsburg-Berwick, PA                  10
California-Lexington Park, MD           10
Carbondale-Marion, IL                   10
Cumberland, MD-WV                       18
Danville, IL                            18
Danville, VA                             8
East Stroudsburg, PA                    10
Holland-Grand Haven, MI                  8
Louisville/Jefferson County, KY-IN      19
Madera, CA                              18
New Bern, NC                            10
New Orleans-Metairie, LA                 2
Ocean City, NJ                          17
Palm Coast, FL                           7
Pascagoula, MS                           5
Pine Bluff, AR                          18
Poughkeepsie-Newburgh-Middletown, NY    13
The Villages, FL                        10
Name: Employment, dtype: int64

In [16]:
bls_national_unemployment = pd.read_excel("data/bls_national_unemployment.xlsx", skiprows = list(range(0, 11)))
bls_national_unemployment = bls_national_unemployment.drop("Annual", axis = 1)
bls_national_unemployment = bls_national_unemployment.set_index("Year")
bls_national_unemployment["Average"] = round(bls_national_unemployment.mean(axis = 1), 1)
bls_avg_ue = bls_national_unemployment["Average"]
bls_avg_ue.head()

  warn("Workbook contains no default style, apply openpyxl's default")


Year
2000    4.0
2001    4.7
2002    5.8
2003    6.0
2004    5.5
Name: Average, dtype: float64

In [17]:
metro_exports_all = metro_exports_all.set_index("Year")
metro_exports_all["Unemployment_Rate"] = metro_exports_all["Unemployment_Rate"].fillna(bls_avg_ue)
metro_exports_all = metro_exports_all.reset_index()
metro_exports_all = metro_exports_all.rename_axis(None, axis=1)

Manufacturing Employment is another difficult column to fill missing values. However, because there are so many missing--some of which because there is too little manufacturing employment to report and some due to the BLS not reporting manufacturing employment data for those areas--we hesitate to drop all rows without it. Instead, we will conduct correlation analysis with and without it before deciding how to proceed.

In [18]:
metro_exports_all[metro_exports_all["Manufacturing_Employment"].isnull()]["MSA"].unique()

array(['Albany, GA', 'Alexandria, LA', 'Ames, IA', 'Anderson, IN',
       'Anderson, SC', 'Athens-Clarke County, GA', 'Bangor, ME',
       'Barnstable Town, MA', 'Beckley, WV', 'Billings, MT',
       'Blacksburg-Christiansburg-Radford, VA',
       'Boston-Cambridge-Newton, MA-NH',
       'Bridgeport-Stamford-Danbury, CT', 'Brunswick-St. Simons, GA',
       'Burlington-South Burlington, VT', 'California-Lexington Park, MD',
       'Cape Girardeau, MO-IL', 'Carbondale-Marion, IL', 'Columbia, MO',
       'Cumberland, MD-WV', 'Danville, VA', 'Daphne-Fairhope-Foley, AL',
       'Dubuque, IA', 'Enid, OK', 'Farmington, NM', 'Florence, SC',
       'Gainesville, GA', 'Goldsboro, NC', 'Grand Island, NE',
       'Great Falls, MT', 'Hammond, LA', 'Harrisonburg, VA',
       'Hartford-West Hartford-East Hartford, CT',
       'Hilton Head Island-Bluffton-Port Royal, SC', 'Hinesville, GA',
       'Holland-Grand Haven, MI', 'Homosassa Springs, FL',
       'Hot Springs, AR', 'Houma-Bayou Cane-Thibodaux,

In [19]:
manu_null_1 = metro_exports_all.groupby("MSA")["Manufacturing_Employment"].agg(lambda x: x.isnull().sum())
manu_null_1[manu_null_1 > 0]

MSA
Albany, GA            6
Alexandria, LA       19
Ames, IA             19
Anderson, IN          8
Anderson, SC          8
                     ..
Twin Falls, ID        7
Valdosta, GA         19
Warner Robins, GA    19
Winchester, VA-WV    19
Worcester, MA        19
Name: Manufacturing_Employment, Length: 75, dtype: int64

In [20]:
manu_null_2 = metro_exports_all.groupby("Year")["Manufacturing_Employment"].agg(lambda x: x.isnull().sum())
manu_null_2[manu_null_2 > 0]

Year
2005    55
2006    56
2007    56
2008    56
2009    59
2010    58
2011    57
2012    57
2013    64
2014    64
2015    64
2016    65
2017    65
2018    67
2019    67
2020    68
2021    67
2022    67
2023    61
Name: Manufacturing_Employment, dtype: int64

When it comes to the weather, all metro areas in the contiguous United States are covered. We are only missing Alaska and Hawaii, which are not covered by our data. Alaska and Hawaii would also potentially outliers on other fronts due to their geography causing extreme cold and isolation as an island, respectively. We will drop Alaska and Hawaii from our dataset.

In [21]:
akhi = metro_exports_all[metro_exports_all["Jan_avg_temp"].isnull()]["MSA"].unique()
metro_exports_all[metro_exports_all["Jan_avg_temp"].isnull()]["MSA"].unique()

array(['Anchorage, AK', 'Fairbanks-College, AK', 'Kahului-Wailuku, HI',
       'Urban Honolulu, HI'], dtype=object)

In [22]:
metro_exports_all = metro_exports_all.set_index("MSA")
metro_exports_all = metro_exports_all.drop(labels = akhi, axis = 0)
metro_exports_all = metro_exports_all.reset_index()
metro_exports_all = metro_exports_all.rename_axis(None, axis=1)

For our State Policy Variables, we see overall that the Corporate Income Tax Rate has a lot of missing values. However, when importing, we know that our initial dataset was missing the year 2009. There are no missing values outside that. So, let's fill out 2009 null values with the 2008 values, since it's unlikely major tax policy changes were made during an election year to be effective in 2009. 

In [23]:
metro_exports_all[(metro_exports_all["Year"] != 2009) & (metro_exports_all["Top_Corporate_Income_Tax_Rate"].isnull())]["MSA"].unique()

array([], dtype=object)

In [24]:
metro_exports_all = metro_exports_all.set_index(["MSA", "Year"])
metro_exports_all["Top_Corporate_Income_Tax_Rate"] = metro_exports_all["Top_Corporate_Income_Tax_Rate"].ffill()

In [25]:
#metro_exports_all["Top_Corporate_Income_Tax_Rate"].head(n =35)

Similarly the other policy variables are filled in for every year, except 2023, because the most recent publication of the Freedom of States Index is for 2022. So, we can use the same process to fill in our 2023 data with the 2022 values.

In [26]:
#metro_exports_all[(metro_exports_all["Year"] != 2023) & (metro_exports_all["Policy_Right-to-Work"].isnull())]["MSA"].unique()

In [27]:
metro_exports_all["Policy_Right-to-Work"] = metro_exports_all["Policy_Right-to-Work"].ffill()
metro_exports_all["Policy_Urban_Growth_Boundary"] = metro_exports_all["Policy_Urban_Growth_Boundary"].ffill()
metro_exports_all["Policy_Pricing_Strategy_Ban"] = metro_exports_all["Policy_Pricing_Strategy_Ban"].ffill()
metro_exports_all["Policy_Anti-Price_Gouging"] = metro_exports_all["Policy_Anti-Price_Gouging"].ffill()

In [28]:
metro_exports_all = metro_exports_all.reset_index()
metro_exports_all = metro_exports_all.rename_axis(None, axis=1)

## Feature Engineering
We will create a few derived features including
* energy_consumption - sum of `month_cooling` and `month_heating` for the year
* avg_cooling - average of all `month_cooling` for the year
* avg_heating - average of all `month_heating` for the year
* avg_weather - average of all 12 months of `avg_temp`
* winter_weather - average of `Jan_avg_temp`, `Feb_avg_temp`, `Mar_avg_temp`
* summer_weather - average of `Jun_avg_temp`, `Jul_avg_temp`, `Aug_avg_temp`
* FHFA_housing - average of the four quarters `FHFA_index`

In [29]:
metro_exports_all.iloc[:,47]

0        0.0
1       20.0
2        0.0
3       18.0
4       16.0
        ... 
6983     0.0
6984     0.0
6985     0.0
6986     0.0
6987     0.0
Name: Sep_heating, Length: 6988, dtype: float64

In [30]:
cols = list(range(27,51))
metro_exports_all["energy_consumption"] = metro_exports_all.iloc[:, cols].sum(axis = 1)

cols = list(range(27,39))
metro_exports_all["avg_cooling"] = metro_exports_all.iloc[:, cols].sum(axis = 1)

cols = list(range(39,51))
metro_exports_all["avg_heating"] = metro_exports_all.iloc[:, cols].sum(axis = 1)

metro_exports_all["energy_consumption"]

0       4825.0
1       4865.0
2       4895.0
3       4893.0
4       4892.0
         ...  
6983    5243.0
6984    5679.0
6985    5399.0
6986    5633.0
6987    5374.0
Name: energy_consumption, Length: 6988, dtype: float64

In [31]:
cols = list(range(15,27))
metro_exports_all["avg_weather"] = metro_exports_all.iloc[:, cols].mean(axis = 1)
metro_exports_all["avg_weather"]

0       63.350000
1       65.133333
2       62.158333
3       63.275000
4       63.175000
          ...    
6983    71.625000
6984    73.666667
6985    73.833333
6986    73.191667
6987    72.433333
Name: avg_weather, Length: 6988, dtype: float64

In [32]:
cols = list(range(15,18))
metro_exports_all["winter_weather"] = metro_exports_all.iloc[:, cols].mean(axis = 1)
metro_exports_all["winter_weather"]

0       48.866667
1       51.366667
2       47.433333
3       48.600000
4       50.966667
          ...    
6983    56.233333
6984    57.466667
6985    57.933333
6986    59.033333
6987    55.533333
Name: winter_weather, Length: 6988, dtype: float64

In [33]:
cols = list(range(20,23))
metro_exports_all["summer_weather"] = metro_exports_all.iloc[:, cols].mean(axis = 1)
metro_exports_all["summer_weather"]

0       80.400000
1       83.433333
2       78.666667
3       81.833333
4       82.166667
          ...    
6983    91.000000
6984    92.366667
6985    92.366667
6986    92.000000
6987    91.000000
Name: summer_weather, Length: 6988, dtype: float64

In [34]:
cols = list(range(6,10))
metro_exports_all["FHFA_index"] = metro_exports_all.iloc[:, cols].mean(axis = 1)
metro_exports_all["FHFA_index"]

0       141.6100
1       153.6775
2       162.9025
3       166.1000
4       169.7750
          ...   
6983    193.9450
6984    207.7200
6985    249.5950
6986    303.0200
6987    316.3150
Name: FHFA_index, Length: 6988, dtype: float64

Thanks to the feature engineering above, we can drop the individual NOAA and FHFA columns.

In [35]:
metro_exports_all = metro_exports_all.drop(['FHFA_index_Q1', 'FHFA_index_Q2', 'FHFA_index_Q3', 'FHFA_index_Q4', 
                                            'Jan_avg_temp', 'Feb_avg_temp', 'Mar_avg_temp', 'Apr_avg_temp', 
                                            'May_avg_temp', 'Jun_avg_temp', 'Jul_avg_temp', 'Aug_avg_temp', 
                                            'Sep_avg_temp', 'Oct_avg_temp', 'Nov_avg_temp', 'Dec_avg_temp', 
                                            'Jan_cooling', 'Feb_cooling', 'Mar_cooling', 'Apr_cooling', 
                                            'May_cooling', 'Jun_cooling', 'Jul_cooling', 'Aug_cooling', 
                                            'Sep_cooling', 'Oct_cooling', 'Nov_cooling', 'Dec_cooling', 
                                            'Jan_heating', 'Feb_heating', 'Mar_heating', 'Apr_heating', 
                                            'May_heating', 'Jun_heating', 'Jul_heating', 'Aug_heating', 
                                            'Sep_heating', 'Oct_heating', 'Nov_heating', 'Dec_heating'], axis = 1)

In [36]:
metro_exports_all.columns

Index(['MSA', 'Year', 'exports', 'Per_Capita_Income', 'Personal_Income',
       'Population', 'Employment', 'Labor_Force', 'Unemployment_Rate',
       'Unemployment_Raw', 'Manufacturing_Employment',
       'Top_Corporate_Income_Tax_Rate', 'Policy_Right-to-Work',
       'Policy_Urban_Growth_Boundary', 'Policy_Pricing_Strategy_Ban',
       'Policy_Anti-Price_Gouging', 'Minimum_Wage', 'USD_to_Euro',
       'USD_to_Pound', 'USD_to_Peso', 'USD_to_Yuan', 'USD_to_Yen',
       'S&P500_Average', 'S&P500_Close', 'DJIA_close', 'energy_consumption',
       'avg_cooling', 'avg_heating', 'avg_weather', 'winter_weather',
       'summer_weather', 'FHFA_index'],
      dtype='object')

## Final Null Elimination
Now, as we review our values that remain, we will create two dataframes: `metro_exports_all` that will include more rows, but exclude the BLS columns--other than the Unemployment Rate--and drop the rows with missing values for the BEA data and `metro_exports_all_features` that will include all columns, but only those without missing values. 

In [37]:
metro_exports_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6988 entries, 0 to 6987
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   MSA                            6988 non-null   object 
 1   Year                           6988 non-null   int64  
 2   exports                        6988 non-null   float64
 3   Per_Capita_Income              6935 non-null   float64
 4   Personal_Income                6935 non-null   float64
 5   Population                     6935 non-null   float64
 6   Employment                     6761 non-null   float64
 7   Labor_Force                    6761 non-null   float64
 8   Unemployment_Rate              6988 non-null   float64
 9   Unemployment_Raw               6761 non-null   float64
 10  Manufacturing_Employment       5815 non-null   float64
 11  Top_Corporate_Income_Tax_Rate  6988 non-null   float64
 12  Policy_Right-to-Work           6988 non-null   f

In [38]:
metro_exports_all_features = metro_exports_all.copy()
metro_exports_all_features = metro_exports_all_features.dropna(axis = 0)

metro_exports_all = metro_exports_all.drop(["Employment", "Labor_Force", "Unemployment_Raw", "Manufacturing_Employment"], axis = 1)
metro_exports_all = metro_exports_all.dropna(axis = 0)

In [39]:
metro_exports_all.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6887 entries, 0 to 6987
Data columns (total 28 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   MSA                            6887 non-null   object 
 1   Year                           6887 non-null   int64  
 2   exports                        6887 non-null   float64
 3   Per_Capita_Income              6887 non-null   float64
 4   Personal_Income                6887 non-null   float64
 5   Population                     6887 non-null   float64
 6   Unemployment_Rate              6887 non-null   float64
 7   Top_Corporate_Income_Tax_Rate  6887 non-null   float64
 8   Policy_Right-to-Work           6887 non-null   float64
 9   Policy_Urban_Growth_Boundary   6887 non-null   float64
 10  Policy_Pricing_Strategy_Ban    6887 non-null   float64
 11  Policy_Anti-Price_Gouging      6887 non-null   float64
 12  Minimum_Wage                   6887 non-null   float6

In [40]:
metro_exports_all_features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5746 entries, 0 to 6987
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   MSA                            5746 non-null   object 
 1   Year                           5746 non-null   int64  
 2   exports                        5746 non-null   float64
 3   Per_Capita_Income              5746 non-null   float64
 4   Personal_Income                5746 non-null   float64
 5   Population                     5746 non-null   float64
 6   Employment                     5746 non-null   float64
 7   Labor_Force                    5746 non-null   float64
 8   Unemployment_Rate              5746 non-null   float64
 9   Unemployment_Raw               5746 non-null   float64
 10  Manufacturing_Employment       5746 non-null   float64
 11  Top_Corporate_Income_Tax_Rate  5746 non-null   float64
 12  Policy_Right-to-Work           5746 non-null   float6

## Export
We will export both dataframes for future use.

In [41]:
metro_exports_all.to_csv('cleaned_metro_exports.csv', index=False)
metro_exports_all_features.to_csv('cleaned_metro_exports_all_features.csv', index=False)