# Data Collection
Goal: Organize your data to streamline the next steps of your
capstone 

In [1]:
#Import pandas, matplotlib.pyplot, and seaborn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [2]:
#change directory to get FRED data
path= '/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/raw/fredgraph'
os.chdir(path)

In [3]:
os.getcwd()

'/Users/josephfrasca/Coding_Stuff/Springboard/Capstone_2/data/raw/fredgraph'

In [4]:
os.listdir()

['Annual.csv', 'Monthly.csv', 'Quarterly.csv']

### Data Loading

In [5]:
# load FRED data - current US national data
df_yr = pd.read_csv('Annual.csv')
df_m = pd.read_csv('Monthly.csv')
df_q = pd.read_csv('Quarterly.csv')

# Data Definition
○ Goal: Gain an understanding of your data features to inform the
next steps of your project.
○ Time estimate: 1-2 hours
■ Column names
■ Data types
■ Description of the columns
■ Counts and percents unique values
■ Ranges of values
- Hint: here are some useful questions to ask yourself during this
process:
- Do your column names correspond to what those columns
store?
- Check the data types of your columns. Are they sensible?
- Calculate summary statistics for each of your columns, such
as mean, median, mode, standard deviation, range, and
number of unique values. What does this tell you about your
data? What do you now need to investigate?

In [6]:
#see an info summary of the data
df_yr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   DATE           59 non-null     object 
 1   SPPOPGROWUSA   59 non-null     float64
 2   MEHOINUSA672N  59 non-null     object 
dtypes: float64(1), object(2)
memory usage: 1.5+ KB


In [7]:
df_m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1268 entries, 0 to 1267
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   DATE           1268 non-null   object
 1   UNRATE         1268 non-null   object
 2   INTDSRUSM193N  1268 non-null   object
 3   CUUR0000SEHA   1268 non-null   object
 4   CSUSHPINSA     1268 non-null   object
 5   HOUST          1268 non-null   object
 6   WPUIP2311001   1268 non-null   object
 7   TLRESCONS      1268 non-null   object
dtypes: object(8)
memory usage: 79.4+ KB


In [8]:
df_q.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258 entries, 0 to 257
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE         258 non-null    object 
 1   RRVRUSQ156N  258 non-null    float64
dtypes: float64(1), object(1)
memory usage: 4.2+ KB


In [9]:
#Call the head/tail method to explore FRED data
#SPPOPGROWUSA has data from 1961-01-01 to 2019-01-01
#MEHOINUSA672N has data from 1984-01-01 to 2018-01-01
df_yr

Unnamed: 0,DATE,SPPOPGROWUSA,MEHOINUSA672N
0,1961-01-01,1.65773,.
1,1962-01-01,1.537997,.
2,1963-01-01,1.439165,.
3,1964-01-01,1.389046,.
4,1965-01-01,1.250172,.
5,1966-01-01,1.154893,.
6,1967-01-01,1.088881,.
7,1968-01-01,0.998461,.
8,1969-01-01,0.977243,.
9,1970-01-01,1.165003,.


In [10]:
'''
UNRATE data from 1948-01-01 to 2020-07-01
INTDSRUSM193N from 1950-01-01 to 2020-07-01
CUUR0000SEHA from 1914-12-01 to 2020-07-01
CSUSHPINSA from 1987-01-01 to 2020-06-01
HOUST from 1959-01-01 to 2020-07-01
WPUIP2311001 from 1986-06-01 to 2020-07-01
TLRESCONS from 2002-01-01 to 2020-06-01
'''
df_m.tail()

Unnamed: 0,DATE,UNRATE,INTDSRUSM193N,CUUR0000SEHA,CSUSHPINSA,HOUST,WPUIP2311001,TLRESCONS
1263,2020-03-01,4.4,0.25,339.519,215.16,1269.0,224.5,595963
1264,2020-04-01,14.7,0.25,340.135,217.32299999999998,934.0,215.9,569892
1265,2020-05-01,13.3,0.25,340.811,218.6,1038.0,217.3,549977
1266,2020-06-01,11.1,0.25,341.294,219.81900000000002,1220.0,221.4,542307
1267,2020-07-01,10.2,0.25,341.95,.,1496.0,225.3,.


In [11]:
#RRVRUSQ156N is from 1956-01-01 to 2020-04-01
df_q.tail()

Unnamed: 0,DATE,RRVRUSQ156N
253,2019-04-01,6.8
254,2019-07-01,6.8
255,2019-10-01,6.4
256,2020-01-01,6.6
257,2020-04-01,5.7


# Data Cleaning

○ Goal: Clean up the data in order to prepare it for the next steps of
your project.
○ Time estimate: 1-2 hours 
■ NA or missing values
■ Duplicates
- Hint: don’t forget about the following awesome Python functions for data
cleaning, which make life a whole lot easier:
- loc[] - filter your data by label
- iloc[] - filter your data by indexes
- apply() - execute a function across an axis of a DataFrame
- drop() - drop columns from a DataFrame
- is_unique() - check if a column is a unique identifier
- Series methods, such as str.contains(), which can be used to check if
a certain substring occurs in a string of a Series, and str.extract(),
which can be used to extract capture groups with a certain regex (or
regular expression ) pattern
- numPy methods like .where(), to clean columns. Recall that such
methods have the structure: np.where(condition, then, else)
- DataFrame methods to check for null values, such as
df.isnull().values.any() 

In [12]:
#convert DATE columns to datetime object
df_q['DATE']= pd.to_datetime(df_q['DATE'])
df_q.dtypes

DATE           datetime64[ns]
RRVRUSQ156N           float64
dtype: object

In [13]:
df_m['DATE']= pd.to_datetime(df_m['DATE'])
df_m.dtypes

DATE             datetime64[ns]
UNRATE                   object
INTDSRUSM193N            object
CUUR0000SEHA             object
CSUSHPINSA               object
HOUST                    object
WPUIP2311001             object
TLRESCONS                object
dtype: object

In [14]:
df_yr['DATE']= pd.to_datetime(df_yr['DATE'])
df_yr.dtypes

DATE             datetime64[ns]
SPPOPGROWUSA            float64
MEHOINUSA672N            object
dtype: object

### Data Joining

In [15]:
#merge yearly df with quarterly: df2
df2 = pd.merge_ordered(df_yr, df_q, fill_method="ffill")
df2

Unnamed: 0,DATE,SPPOPGROWUSA,MEHOINUSA672N,RRVRUSQ156N
0,1956-01-01,,,6.2
1,1956-04-01,,,5.9
2,1956-07-01,,,6.3
3,1956-10-01,,,5.8
4,1957-01-01,,,5.3
...,...,...,...,...
253,2019-04-01,0.473954,.,6.8
254,2019-07-01,0.473954,.,6.8
255,2019-10-01,0.473954,.,6.4
256,2020-01-01,0.473954,.,6.6


In [32]:
#merge df2 with monthly df: df3
df3 = pd.merge_ordered(df2, df_m, fill_method="ffill")
df3.iloc[853:]

Unnamed: 0,DATE,SPPOPGROWUSA,MEHOINUSA672N,RRVRUSQ156N,UNRATE,INTDSRUSM193N,CUUR0000SEHA,CSUSHPINSA,HOUST,WPUIP2311001,TLRESCONS
853,1986-01-01,0.924164,54608,6.9,6.7,7.5,115.500,.,1972,.,.
854,1986-02-01,0.924164,54608,6.9,7.2,7.5,115.600,.,1848,.,.
855,1986-03-01,0.924164,54608,6.9,7.2,7.1,116.200,.,1876,.,.
856,1986-04-01,0.924164,54608,7.3,7.1,6.83,117.400,.,1933,.,.
857,1986-05-01,0.924164,54608,7.3,7.2,6.5,117.600,.,1854,.,.
...,...,...,...,...,...,...,...,...,...,...,...
1263,2020-03-01,0.473954,.,6.6,4.4,0.25,339.519,215.16,1269.0,224.5,595963
1264,2020-04-01,0.473954,.,5.7,14.7,0.25,340.135,217.32299999999998,934.0,215.9,569892
1265,2020-05-01,0.473954,.,5.7,13.3,0.25,340.811,218.6,1038.0,217.3,549977
1266,2020-06-01,0.473954,.,5.7,11.1,0.25,341.294,219.81900000000002,1220.0,221.4,542307


In [34]:
#drop the outlier years (pre 1987), 9/10 features have data 1987 on.
df3_87 = df3[df3['DATE'] >= '1987']
df3_87

Unnamed: 0,DATE,SPPOPGROWUSA,MEHOINUSA672N,RRVRUSQ156N,UNRATE,INTDSRUSM193N,CUUR0000SEHA,CSUSHPINSA,HOUST,WPUIP2311001,TLRESCONS
865,1987-01-01,0.893829,55260,7.4,6.6,5.5,121.300,63.755,1774,100.0,.
866,1987-02-01,0.893829,55260,7.4,6.6,5.5,121.700,64.156,1784,100.4,.
867,1987-03-01,0.893829,55260,7.4,6.6,5.5,121.800,64.491,1726,100.7,.
868,1987-04-01,0.893829,55260,7.5,6.3,5.5,122.000,64.994,1614,101.1,.
869,1987-05-01,0.893829,55260,7.5,6.3,5.5,122.300,65.568,1628,101.3,.
...,...,...,...,...,...,...,...,...,...,...,...
1263,2020-03-01,0.473954,.,6.6,4.4,0.25,339.519,215.16,1269.0,224.5,595963
1264,2020-04-01,0.473954,.,5.7,14.7,0.25,340.135,217.32299999999998,934.0,215.9,569892
1265,2020-05-01,0.473954,.,5.7,13.3,0.25,340.811,218.6,1038.0,217.3,549977
1266,2020-06-01,0.473954,.,5.7,11.1,0.25,341.294,219.81900000000002,1220.0,221.4,542307


In [18]:
#convert dtype objects to floats as needed

In [19]:
#scale data to commpare (maybe prepreprocessing?)

In [20]:
#change column headers to make more user friendly

NOTES
- could use total construction spending (data from 90s) vs. TLRESCONS (data from 2002)
- could use linear line instead of ffill for missing timeseries data (ie. when going from annual data only to monthly)

SUMMARY
- first I ...

# Data Organization
○ Goal: Create a file structure and add your work to the GitHub
repository you’ve created for this project.
○ Time estimate: 1-2 hours
■ File structure
■ GitHub
■ Hint: the glob library could come in handy here…
■ Remind yourself of why GitHub is useful. What are the main
motivations for making a GitHub repository? 