# Import Source data from Kaggle

Import and clean IPO Data
Data set has 3,762 rows, each row representing the IPO company
Over 1,600 columns
The file contains the open, close, low, high, and volume for the 261 trading days following the IPO. (1,305 data points). This data is listed inefficiently in the columns.

First step of data cleaning will be to use the source data to create two tables
    1. A pricing table simply listing the stock ticker and trading day as attribute columns. Stock open, close, low, high, and volume will be listed as the values
    2. An attribute table containing metadata on the stock ticker. I.e. company name, date founded, IPO date, CEO, headquartered location, etc. (all of these data points are on the source file as well


In [1]:
import os
os.getcwd()
os.chdir('C:\\Users\\nmur1\\Google Drive\\Springboard\\Capstone 1\\SourceData')

In [2]:
#import packages
import pandas as pd
import numpy as np
import os

sourcepath = r'C:\Users\nmur1\Google Drive\Springboard\Capstone 1\SourceData'

#import raw data as RawDF
RawDF = pd.read_csv('IPO Data.csv',low_memory = False, encoding='ISO-8859-1')

#break off stockticker and pricing data
Pricing = RawDF.iloc[:,0:1319]
Pricing = Pricing.drop(Pricing.columns[1:9], axis = 1)

#Reindex to symbol 
Pricing = Pricing.set_index('Symbol')

#add Ipo date to dateframe
IPO_Date = RawDF.loc[:,['Symbol','ipoDate']].set_index('Symbol')
IPO_Date['ipoDate'] = pd.to_datetime(IPO_Date['ipoDate'])
Pricing = pd.concat([Pricing,IPO_Date], axis = 1)

#after inspection I foudn that MITT repeated 64 times. Drop the duplicates here
Pricing = Pricing.drop_duplicates()
RawDF.shape

(3762, 1664)

### Pull out daily pricing from the columns to make a Pricing Table
The next sub routine will loop through the pricing dataframe established above and create a new dataframe with the Symbol, Trading Day, Open, Closing, Low, High, and Volume as the column. Row will represent the values for a stock on the given trading day

In [3]:
import datetime as dt

#need to clean data and create a new talbe with ticker, trading day, open price, close price, high, low, and volume
#in columns with the values for in the rows

#create empty lists for my columns
tday = []
Closing = []
High = []
Opening = []
Low = []
Volume = []
tday = []

cols = 0
day = 0

#loop through each column of the dataframe to store the pertenint data
while cols <= 1305:
    
    df = pd.DataFrame(Pricing.index)
    df['trade day'] = day
    tday.append(df)
    Closing.append(Pricing.iloc[:,cols]) #closing price starts at 0
    High.append(Pricing.iloc[:,cols+1]) #high one column over from closing
    Opening.append(Pricing.iloc[:,cols+2]) #opening two columns over from closing
    Low.append(Pricing.iloc[:,cols+3]) # low 3 columns over from closing
    Volume.append(Pricing.iloc[:,cols+4]) #Volume 3 columns over
    
    day = day + 1
    cols = cols + 5 #increment column by 5 (new day is every 5 columns)

#run concatenations on the indexes, to turn into dataframes     
df_Closing = pd.concat(Closing, axis = 0).reset_index()
df_Closing = df_Closing.drop(df_Closing.columns[0], axis = 1)

df_High = pd.concat(High, axis = 0).reset_index()
df_High = df_High.drop(df_High.columns[0], axis = 1)

df_Opening = pd.concat(Opening, axis = 0).reset_index()
df_Opening = df_Opening.drop(df_Opening.columns[0], axis = 1)

df_Low = pd.concat(Low, axis = 0).reset_index()
df_Low = df_Low.drop(df_Low.columns[0], axis = 1)

df_Volume = pd.concat(Volume, axis = 0).reset_index()
df_Volume = df_Volume.drop(df_Volume.columns[0], axis = 1)

df_Day = pd.concat(tday, axis = 0).reset_index()
df_Day = df_Day.drop(df_Day.columns[0], axis = 1)

#concatenate all the above dataframes vertically
df_pricing = pd.concat([df_Day, df_Closing, df_High, df_Opening, df_Low, df_Volume], axis = 1)
df_pricing.columns = ['Symbol', 'Trade Day', 'C', 'H', 'O', 'L', 'V']

#inspect new dataframe
#should have 261 records for each ticker (1 for each trading day)
#drop na's or records that don't have any pricing data
df_pricing.dropna(inplace = True)
df_pricing

Unnamed: 0,Symbol,Trade Day,C,H,O,L,V
0,A,0,28.6358,33.5207,27.3725,30.6572,59753154.0
1,AAC,0,18.5000,20.1000,17.6000,20.0000,2799073.0
2,AAOI,0,9.9600,10.0900,9.3700,10.0000,948999.0
3,AAP,0,13.9000,14.4667,13.3833,13.4000,371100.0
4,AAT,0,21.2500,22.0000,21.1800,21.5300,15536889.0
...,...,...,...,...,...,...,...
968603,ZNH,261,3.9587,4.0000,3.8753,3.8753,98700.0
968606,ZSAN,261,45.6000,45.6000,42.8000,42.8000,21.0
968607,ZTO,261,15.7300,15.8950,15.3500,15.6000,1922801.0
968611,ZX,261,3.4600,3.5500,3.3600,3.5300,22850.0


In [4]:
#Save CleanData set to new folder
os.chdir('C:\\Users\\nmur1\\Google Drive\\Springboard\\Capstone 1\\CleanData')
df_pricing.to_csv('DailyPricing.csv')

# Make Attribute Table with Key MetaData Points

My next step will be to create an attribute or metadata table containing all of the key stats on the stock company. There will be a record/row for each stock with multiple data points including date founded, revenue, income, ipoDate, day of week IPO'd, State, etc

In [5]:
#make my attribute table with the other metadata
#metadata is in columns 0 through 9 and all columns from 1319 to the end of the dataset. 
#the columns in the middle were pricing data points that I split off into the pricing table listed above

#split out desired columns using iloc. Concatenate back to one and inspect:
Attribute1 = RawDF.iloc[:,0:9]
Attribute2 = RawDF.iloc[:,1319:]
FullAttribute = pd.concat([Attribute1,Attribute2], axis = 1)
FullAttribute.shape

(3762, 354)

# Reduce Size of Metadata table by dropping irrelevant columns

After analysis my Attribute table still had 354 columns which was quite cumbersome. I reviewed the columns and found that the majority had N/A. To start the below code will identify columsn and % of N/A values. To start I'm only going to keep attributes with less than 50% N/A values. This leaves 50 columns wiht most of the key metadata I'm looking for. I.e. IPO date, CEO age/gender, company location, company age at IPO, year founded etc.

In [6]:
#create df with N/A percentages
nas=pd.DataFrame(FullAttribute.isnull().sum().sort_values(ascending=True)/len(FullAttribute),columns = ['percent'])

#filter percent less than 50
nasFilt = nas['percent']<=.5

#create and apply boolean series filter
tokeep = nas[nasFilt]
df_Att =FullAttribute.loc[:,tokeep.index]

#inspect data
df_Att.to_clipboard()
df_Att.shape



(3762, 50)

# Inspect the Data I Dropped to Determine if I want to Source Elsewhere

In [7]:
NullColumns =pd.DataFrame(FullAttribute.isnull().sum().sort_values(ascending=False)/len(FullAttribute),columns = ['percent'])
NullFilt = NullColumns['percent']>.50

NullColumns[NullFilt]

Unnamed: 0,percent
Other_intangiblesYearBeforeIPO,0.999734
Loans_issuedYearBeforeIPO,0.999734
Preferred_dividendsYearBeforeIPO,0.999734
Restricted_cash_and_cash_equivalentsYearBeforeIPO,0.999734
Provision_for_loan_lossesYearBeforeIPO,0.999734
...,...
Net_cash_provided_by_operating_activitiesYearBeforeIPO,0.891015
Net_incomeYearBeforeIPO,0.876396
Fiscal_year_ends_in_December_USDYearBeforeIPO,0.865231
exactDiffernce,0.583200


# Inspect the number of unique values in each column

In [8]:
df_Att.nunique()

Symbol                        3699
Safe                             2
Profitable                       2
yearDifferenceGrouped            1
FoundingDateGrouped              6
usablePresidentGender            8
usableCEOGender                  8
FiscalMonth                     13
USACompany                       3
MarketYearTrend               2144
Market6MonthTrend             2144
Market3MonthTrend             2144
MarketMonthTrend              2144
ipoDate                       2239
Summary Quote                 3700
HomeRun                          2
Month                           12
MarketCap                     3552
Name                          3556
dayOfWeek                        5
Day                             31
Year                            23
daysProfitGrouped                5
daysProfit                     263
DaysBetterThanSP               153
LastSale                      2756
usablePresidentAge               8
usableCEOAge                     8
CEOGender           

### Gender contains 8 values - need to fix that

In [9]:
df_Att['usableCEOGender'].value_counts()

Blank            1971
Unknown           855
male              763
unknown            81
female             44
mostly_male        21
andy               16
mostly_female      11
Name: usableCEOGender, dtype: int64

In [10]:
df_Att['usableCEOGender'] = df_Att['usableCEOGender'].str.replace('Unknown', 'unknown')
df_Att['usableCEOGender'] = df_Att['usableCEOGender'].str.replace('Blank', 'unknown')
df_Att['usableCEOGender'] = df_Att['usableCEOGender'].str.replace('mostly_male', 'male')
df_Att['usableCEOGender'] = df_Att['usableCEOGender'].str.replace('mostly_female', 'female')
df_Att['usableCEOGender'] = df_Att['usableCEOGender'].str.replace('andy', 'male')

In [11]:
df_Att['usableCEOGender'].value_counts()

unknown    2907
male        800
female       55
Name: usableCEOGender, dtype: int64

In [12]:
#CEOGender field looks to contain more relevant data then usableCEOGender

df_Att.drop(columns = 'usableCEOGender',inplace = True)

Fixed misidentified genders in other CEO Gender column

In [13]:
df_Att['CEOGender'].value_counts()

male             2683
unknown           283
female            155
mostly_male        57
andy               51
mostly_female      34
Name: CEOGender, dtype: int64

In [14]:
df_Att['CEOGender'] = df_Att['CEOGender'].str.replace('Unknown', 'unknown')
df_Att['CEOGender'] = df_Att['CEOGender'].str.replace('Blank', 'unknown')
df_Att['CEOGender'] = df_Att['CEOGender'].str.replace('mostly_male', 'male')
df_Att['CEOGender'] = df_Att['CEOGender'].str.replace('mostly_female', 'female')
df_Att['CEOGender'] = df_Att['CEOGender'].str.replace('andy', 'male')
df_Att['CEOGender'].value_counts()

male       2791
unknown     283
female      189
Name: CEOGender, dtype: int64

## Drop Other Unnecessary Columns

In [15]:
todrop = ['usablePresidentGender', 'PresidentGender', 'usablePresidentAge', 'HomeRun', 'usableCEOAge','PresidentName','CEOInChargeDuringIPO','CEOTakeOver',
             'PresidentAge','presidentInChargeDuringIPO','PresidentTakeOver','Safe','yearDifferenceGrouped', 'daysProfitGrouped',
                'daysProfit', 'DaysBetterThanSP', 'FoundingDateGrouped' ]

for d in todrop:
    
    try:
        df_Att.drop(columns = d, inplace = True)
    except:
        df_Att

df_Att.nunique()

Symbol               3699
Profitable              2
FiscalMonth            13
USACompany              3
MarketYearTrend      2144
Market6MonthTrend    2144
Market3MonthTrend    2144
MarketMonthTrend     2144
ipoDate              2239
Summary Quote        3700
Month                  12
MarketCap            3552
Name                 3556
dayOfWeek               5
Day                    31
Year                   23
LastSale             2756
CEOGender               3
CEOName              2894
Industry              132
Sector                 12
YearFounded           154
exactDateFounded     1777
yearDifference        166
CEOAge                 57
employees            1432
employeesGrouped        7
FiscalDateEnd          26
City                  904
stateCountry           97
netIncome            2558
Revenue              2188
dtype: int64

After significantly paring down my attribute table I reviewed all of the fields that were greater than 50% N/A to see if there's anything I wanted to keep. Most are key financila metrics for the company pre-IPO. My hypothesis is that these metrics would have an impact on pricing performance so I will need to find another datasource to pull those metrics in. For purposes of this exercise I'm not going to pull in all 304 rows but we'll start with the big ones:

Pre IPO Revenue ||
Pre IPO EBIDTA ||
Pre IPO Cash ||

I don't have a datasource for those yet so let's clean up the table we have and add a few functions in the next two steps

### Run Additional Conversion Functions to Clean Metadata

In [16]:
#define fiscal quarter based on input date

def FQ(date):
    
    try:
        
        if date.month in range(0,4):
            return '1_' + str(date.year)
        elif date.month in range(4,7):
            return '2_' + str(date.year)
        elif date.month in range(7,10):
            return '3_' + str(date.year)
        elif date.month in range(10,13):
            return '4_'+ str(date.year)
        else:
            return 'NA'
    except:
       
        return "NA"

In [17]:
#create an attribute table for additional analysis
#pd.options.display.float_format = '{:.5f}'.format
#define US or Other country
def country(x):
    
    if len(str(x).strip()) == 2:
        return 'US'
    else:
        return 'Other'
        
#Revenue and Income columns end with 'B' to denote billions, 'M' to denote millions, or have the straight number if
# less than 1 million. The below function will stip the last character and convert to a float value that consistently
# represents revenue and income has millions. I.e. 1 billion displayed as 1,000; 1 million displayed as 1; 100,000 displayed
#as .1

def conversion(x):
   
    s = str(x).strip() #ensure there are no spaces in string

    suffix = str(x)[-1] #get last charcter
    
    if suffix == 'B': #if B define multiple as 1,000
        mult = 1000
    elif suffix == 'M': #if M define multiple as 1
        mult = 1
    else:
        mult = 0.000001 #if not B or M multiple is 1/1000000
 
    
    #loop through stirng and remove non numbers. Mainly $ signs and commas
    #noticed that some strings also had typos with parentheticals so 
    #the below loop ensures that all non numbers except for decimial points and negative symbols
    #are removed
    
    for letter in s:
         if letter.isdigit() == False and letter != '.' and letter != "-":
            s = s.replace(letter,"")
    
    #handle the nulls
    if pd.isnull(x) == True:
        s = 0
    
    #convert the final string to a float and multiple by the multiple
    return round(float(s) * mult, 3)

def DOW(x):

    day = ['Mon','Tue','Wed','Thur','Fri','Sat','Sun']
    return day[x.weekday()]
        

df_Att['Country'] = df_Att['stateCountry'].apply(country)
df_Att['Revenue_M'] = df_Att['Revenue'].apply(conversion)
df_Att['Income_M'] = df_Att['netIncome'].apply(conversion)
df_Att.ipoDate = pd.to_datetime(df_Att['ipoDate'])
df_Att['DayofWeek'] = df_Att['ipoDate'].apply(DOW) #Add the name of the day ipo'd
df_Att = df_Att[df_Att.Country == 'US']# Filter on US
df_Att['AgeatIPO'] = df_Att['Year'] - df_Att['YearFounded']

df_Att.sort_values('Revenue_M')#sort by revenue
df_Att['FQ'] = df_Att['ipoDate'].apply(FQ)

df_Att.to_excel('IPO Attributes.xls') #export to excel for review/analysis

df_Att


Unnamed: 0,Symbol,Profitable,FiscalMonth,USACompany,MarketYearTrend,Market6MonthTrend,Market3MonthTrend,MarketMonthTrend,ipoDate,Summary Quote,...,City,stateCountry,netIncome,Revenue,Country,Revenue_M,Income_M,DayofWeek,AgeatIPO,FQ
0,A,1,Oct,Yes,2.039844,2.312974,2.352508,1.601165,1999-11-18,https://www.nasdaq.com/symbol/a,...,Santa Clara,CA,$684.00M,$4.47B,US,4470.00,684.00,Thur,0.0,4_1999
1,AAC,1,Dec,Yes,0.881839,0.138536,-1.194498,-2.452645,2014-10-02,https://www.nasdaq.com/symbol/aac,...,Brentwood,TN,$-20.58M,$317.64M,US,317.64,-20.58,Thur,0.0,4_2014
2,AAOI,1,Dec,Yes,1.443672,1.286165,0.926398,0.761732,2013-09-26,https://www.nasdaq.com/symbol/aaoi,...,Sugar Land,TX,$73.95M,$382.33M,US,382.33,73.95,Thur,16.0,3_2013
3,AAP,1,Dec,Yes,-0.745906,-0.128110,1.153716,0.613550,2001-11-29,https://www.nasdaq.com/symbol/aap,...,Roanoke,VA,$475.51M,$9.37B,US,9370.00,475.51,Thur,72.0,4_2001
4,AAT,0,Dec,Yes,2.263666,1.813736,1.824305,1.692499,2011-01-13,https://www.nasdaq.com/symbol/aat,...,San Diego,CA,$29.08M,$311.68M,US,311.68,29.08,Thur,1.0,1_2011
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3754,ZSAN,0,Dec,Yes,1.055378,0.515719,-0.448387,-0.373177,2015-01-27,https://www.nasdaq.com/symbol/zsan,...,Fremont,CA,$-29.11M,,US,0.00,-29.11,Tue,3.0,1_2015
3756,ZTS,1,Dec,Yes,2.456595,2.497749,1.941604,1.830451,2013-02-01,https://www.nasdaq.com/symbol/zts,...,New York,NY,$864.00M,$5.31B,US,5310.00,864.00,Fri,1.0,1_2013
3757,ZUMZ,1,Jan,Yes,0.546793,-0.792605,-0.612410,0.510960,2005-05-06,https://www.nasdaq.com/symbol/zumz,...,Lynnwood,WA,$26.80M,$927.40M,US,927.40,26.80,Fri,27.0,2_2005
3758,ZUO,0,Jan,Yes,0.794603,-0.113986,-0.734112,0.031797,2018-04-12,https://www.nasdaq.com/symbol/zuo,...,Redwood City,CA,$-47.16M,$167.93M,US,167.93,-47.16,Thur,11.0,2_2018


In [18]:
to_drop = [ 'Profitable', 'USACompany', 'MarketYearTrend',
       'Market6MonthTrend', 'Market3MonthTrend', 'MarketMonthTrend', 
       'Summary Quote', 'MarketCap',
        'LastSale', 'CEOName',
       'exactDateFounded', 'yearDifference', 
        'employeesGrouped', 'FiscalDateEnd', 'City',
        'Country', 'FiscalMonth', 'Name', 'dayOfWeek','stateCountry', 'netIncome','Revenue', 'YearFounded']

for d in to_drop:
    
    try:
        df_Att.drop(columns = d, inplace = True)
    except:
        df_Att

    
df_Att

Unnamed: 0,Symbol,ipoDate,Month,Day,Year,CEOGender,Industry,Sector,CEOAge,employees,Revenue_M,Income_M,DayofWeek,AgeatIPO,FQ
0,A,1999-11-18,11,18,1999,male,Biotechnology: Laboratory Analytical Instruments,Capital Goods,56.0,13500,4470.00,684.00,Thur,0.0,4_1999
1,AAC,2014-10-02,10,2,2014,male,Medical Specialities,Health Care,46.0,2100,317.64,-20.58,Thur,0.0,4_2014
2,AAOI,2013-09-26,9,26,2013,unknown,Semiconductors,Technology,54.0,3054,382.33,73.95,Thur,16.0,3_2013
3,AAP,2001-11-29,11,29,2001,male,Other Specialty Stores,Consumer Services,59.0,71000,9370.00,475.51,Thur,72.0,4_2001
4,AAT,2011-01-13,1,13,2011,male,Real Estate Investment Trusts,Consumer Services,79.0,194,311.68,29.08,Thur,1.0,1_2011
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3754,ZSAN,2015-01-27,1,27,2015,male,Major Pharmaceuticals,Health Care,68.0,51,0.00,-29.11,Tue,3.0,1_2015
3756,ZTS,2013-02-01,2,1,2013,male,Major Pharmaceuticals,Health Care,66.0,9200,5310.00,864.00,Fri,1.0,1_2013
3757,ZUMZ,2005-05-06,5,6,2005,male,Clothing/Shoe/Accessory Stores,Consumer Services,57.0,8900,927.40,26.80,Fri,27.0,2_2005
3758,ZUO,2018-04-12,4,12,2018,male,,,,933,167.93,-47.16,Thur,11.0,2_2018


In [19]:
#save attributes to new csv file
df_Att.to_csv('IPO Attributes.csv')

In [20]:
#print min and max ipoDates

print(df_Att.ipoDate.min())
print(df_Att.ipoDate.max())

1996-01-05 00:00:00
2018-04-13 00:00:00


My initial data collection and cleaning phase is complete. I now have a much more managable set of data that's conducive to analysis. A clean pricing table with pricing laid out by day and company in the rows and low, high, opening, closing, and volume in the columns. 

My attribute table has my stocks filtered out by US company and each stock has 50 pertient meta data points



Next Steps:
    
• Import MacroEconomic for all months/years relevant to IPO's. I'll start with gdp growth, unemployment, and interest rates I plan to use Pandas DataReader function which has a direct link to https://fred.stlouisfed.org

• I'll need to do some more digging to see where I can find the pre-IPO Revenue, Cash, and EBITDA metrics I mentioned in the steps above to add to my attribute table 

# Import GDP, Fed Funds Rate, Unemployment, Consumer Sentiment

Pandas DataReader package has a great way to improt macro-economic data directly from the fred database
make sure you have the data reader package installed on your pc: 
conda install -c anaconda pandas-datareader in the anaconda prompt download for you

In [21]:
from pandas_datareader.data import DataReader
from datetime import date
start = date(1990,1,1)

#import GDP and calculate quarterly growth
GDP = DataReader('GDPC1', 'fred', start )
GDP['growth'] = GDP.GDPC1.pct_change(periods = 4)
GDP = GDP.reset_index()
GDP.columns = ['DATE', 'GDP', 'GDP Growh']
GDP['FQ'] = GDP.DATE.apply(FQ)
GDP.drop(columns = 'DATE', inplace = True)

def FredQ(df, dcolumns, values):
    df['FQ'] = df[dcolumns].apply(FQ)
    df_Q = pd.DataFrame(df.groupby(['FQ'])[values].mean()).reset_index()
    #df_Q.drop(columns = dcolumns, inplace = True)
    return df_Q

#import Fed Funds Interest rate
EFFR = DataReader('FEDFUNDS', 'fred',start).reset_index()
EFFRQ = FredQ(EFFR, 'DATE', 'FEDFUNDS')

#import unemployment rate
UNRATE = DataReader('UNRATE', 'fred',start).reset_index()
UNRATEQ = FredQ(UNRATE, 'DATE', 'UNRATE')

#import Consumer Sentiment Score
CS = DataReader('UMCSENT', 'fred', start).reset_index()
CSQ = FredQ(CS, 'DATE', 'UMCSENT')


#merge dataframes to one
Macro = pd.merge(GDP, EFFRQ, on = 'FQ', how = 'left')
Macro = pd.merge(Macro, UNRATEQ, on = 'FQ', how = 'left')
Macro = pd.merge(Macro, CSQ, on = 'FQ', how = 'left')





  from pandas.util.testing import assert_frame_equal


# Import Stock Market and Real Estate Data from Quandl

Python has a Quandl package but I'm going to use the requests package just to practice .get requests and converting json's into dataframes 

In [22]:
import requests
import collections

def Quandl(url, param_dict, col_names):
    
    r = requests.get(url, params = param_dict)
    json_data = r.json()
    df = pd.DataFrame(json_data['dataset']['data'], columns = col_names)
    df.DATE = df.DATE.astype('datetime64[ns]')
    return df

In [23]:
#define quarter based first day of month after QE

def FQ_Beg(date):
    
    try:
        
        if date.month ==4 and date.day == 1:
            return '1_' + str(date.year)
       
        elif date.month== 7 and date.day ==1:
            return '2_' + str(date.year)
        
        elif date.month == 10 and date.day == 1:
            return '3_' + str(date.year)
        
        elif date.month ==1 and date.day == 1:
            return '4_'+ str(date.year - 1)
        
        else:
            return 'NA'
    except:
       
        return "NA"


In [24]:
# Import S&P 500 P&E Ratios
API_KEY = 'pTsozhv5F_xzhfyMkVQi'
url = "https://www.quandl.com/api/v3/datasets/MULTPL/SP500_PE_RATIO_MONTH.json"
p = dict(start_date = '1990-01-01' , api_key = API_KEY)

#call my quandl function
PE = Quandl(url, p, ['DATE', 'PE_Ratio'])

PE['FQ'] = PE['DATE'].apply(FQ)
PE_Q = pd.DataFrame(PE.groupby(['FQ'])['PE_Ratio'].mean()).reset_index()


In [25]:
#Import S&P Prices and Get Growth Rate by Quarter

API_KEY = 'pTsozhv5F_xzhfyMkVQi'
url = "https://www.quandl.com/api/v3/datasets/MULTPL/SP500_REAL_PRICE_MONTH.json"
p = dict(start_date = '1990-01-01' , api_key = API_KEY)

#call my quandl function
SP = Quandl(url, p, ['DATE', 'SP_Value'])

#Use the FQ_Beg function to assign quarters to only the quarter ending dates. Which i've defined as 
#the first day of the following quarter. The Dataset is reporing monthly S&P data as of the first dat of the month
#I only want the quarter end dates and I'm going to drop everything else

SP['FQ'] = SP.DATE.apply(FQ_Beg)

#Filter out all dates that aren't a quarter end
SP_Q = SP[SP['FQ'] != 'NA']
SP_Q = SP_Q.sort_values(by = 'DATE', ascending = True)

#Calculate percent changes from prior quarter
SP_Q['SP500 Growth'] = SP_Q['SP_Value'].pct_change(periods = 4)
SP_Q.drop(columns = ['DATE'], inplace = True)



In [26]:
#Write function to pull in all states for Zillow Data. Default API only gave option to pull one state at a time

def GetZillow(indicator, rType):

    API_KEY = 'pTsozhv5F_xzhfyMkVQi'

    #pull list of states from quandl's zillow documentation
    sturl = "https://s3.amazonaws.com/quandl-production-static/zillow/state.txt"
    states = pd.read_csv(sturl, delimiter = '|')
    
    
    Zillow = []
    Errors = []

    #loop through each state and adjust url per the state code
    for st in states['CODE']:

        quandl = 'S' + str(st) + '_' + indicator + '.json'
        url = "https://www.quandl.com/api/v3/datasets/ZILLOW/" + quandl
        params = dict(api_key = API_KEY, start_date = '1996-01-01')
        
        #r = requests.get(url, params = params)
        #json_data = r.json()
       
        #if statecode not found drop to error table
        try:
            df = Quandl(url, p, ['DATE', indicator])
            df['statecode'] = st
            Zillow.append(df)
        except:
            Errors.append(st)

    #Get lists ready for concatenation
    Housing = pd.concat(Zillow)
    MissingStates = pd.DataFrame(Errors)

    #adjust column names    
    Housing.columns = ['Date', indicator, 'CODE']
    MissingStates.columns = ['CODE']

    #concatenate df's
    Housing = pd.merge(Housing, states, on ='CODE', how = 'left')
    MissingStates = pd.merge(MissingStates, states, on = 'CODE', how = 'left')

          
    if rType == 'housing':
        return Housing
    elif rType == 'errors':
        return MissingStates
    else:
        print('Wrong rType. Needs to be housing or errors')
        



In [27]:
Foreclosure = GetZillow('HSAFRAL', 'housing')


In [28]:
Foreclosure['FQ'] = Foreclosure['Date'].apply(FQ)
For_Q = pd.DataFrame(Foreclosure.groupby(['FQ'])['HSAFRAL'].mean()).reset_index()


In [29]:
#Pull pricing indexes from FMAC data on quandl

quandl = "HPI_ST_SA.json"
url = "https://www.quandl.com/api/v3/datasets/FMAC/" + quandl
params = dict(api_key = API_KEY)
r = requests.get(url, params = params)
json_data = r.json()

Housing = pd.DataFrame(json_data['dataset']['data']) #convert to df
Housing.columns = json_data['dataset']['column_names'] #add column names
Housing['Date'] = Housing['Date'].astype('datetime64[ns]') #convert date 
Housing_Melt = pd.melt(Housing, id_vars = 'Date', var_name = 'State', value_name = 'HPI') #melt data. states were laid out by column
Housing_Melt['FQ'] = Housing_Melt['Date'].apply(FQ) # apply quarter
Housing_Q = pd.DataFrame(Housing_Melt.groupby(['FQ'])['HPI'].mean()).reset_index() #summarize average by quarter



In [30]:
#Merge to my other Macro dataframe export to clean data folder

Macro2 = pd.merge(Macro, PE_Q, on = 'FQ', how = 'left' )
Macro2 = pd.merge(Macro2, SP_Q, on = 'FQ', how = 'left')
Macro2 = pd.merge(Macro2, For_Q, on = 'FQ', how = 'left' )
Macro2 = pd.merge(Macro2, Housing_Q, on = 'FQ', how = 'left' )


#export table to my clean data folder
os.getcwd()
os.chdir('C:\\Users\\nmur1\\Google Drive\\Springboard\\Capstone 1\\CleanData')

Macro2.to_csv('MacroEcon.csv')