# Project: Milestone 4

## Connecting to an API/Pulling in the Data and Cleaning/Formatting

***Instructions)***

Perform at least 5 data transformation and/or cleansing steps to your API data. The below examples are not required - they are just potential transformations you could do. If your data doesn't work for these scenarios, complete different transformations. You can do the same transformation multiple times if you needed to clean your data. The goal is a clean dataset at the end of the milestone.

- Replace Headers
- Format data into a more readable format
- Identify outliers and bad data
- Find duplicates
- Fix casing or inconsistent values
- Conduct Fuzzy Matching

Make sure you clearly label each transformation (Step #1, Step #2, etc.) in your code and describe what it is doing in 1-2 sentences. You can submit a Jupyter Notebook or a PDF of your code. If you submit a .py file you need to also include a PDF or attachment of your results.


***Answer)***

**#0. Intro Work**

First I must write my API Key and Secret to a JSON file in order to access the API

In [2]:
import json
import pprint
import getpass

# defining the data (including the API Key and the Secret Key)
# I removed the actual values for security purposes, but the values are now housed in the JSON file
data = {
    "prod_api_key": "---",
    "prod_secret": "---"  # additional authentication layer
}

# specifying the JSON file name
filename = 'hc_prod_api_key.json'

# writing the data to a JSON file
with open(filename, 'w') as file:
    json.dump(data, file, indent=4)




In [3]:
# This JSON file has my API keys
with open('hc_prod_api_key.json') as f:
    keys = json.load(f)
    prod_key = keys['prod_api_key']
    prod_secret = keys['prod_secret']

It is now time to call our API.

This API is not free and actually has a hefty cost per unit requested. In this case I will request 5 unites since that is all that will be neeeded for the analysis, and will keep costs at a minimum.

In [7]:
# creating a list of MSAs that will be used for the API call
msa_list = ['35620', '31080', '16980', '19100', '26420']

# list comprehension to create the list of dictionaries
data = [{"msa": msa_code} for msa_code in msa_list]

In [8]:
import requests

# destination for the api call -> MSA Batch Gross Rental Yield
url = 'https://api.housecanary.com/v2/msa/hcri'

# data to be sent in JSON format was done in the previous cell

# making a POST request with JSON payload
response = requests.post(url, json=data, auth=(prod_key, prod_secret))

print(response.url)
pprint.pprint(response.json())

https://api.housecanary.com/v2/msa/hcri
[{'msa/hcri': {'api_code': 0,
               'api_code_description': 'ok',
               'result': {'gross_yield_average': 0.0661,
                          'gross_yield_count': 5047844,
                          'gross_yield_median': 0.0643}},
  'msa_info': {'msa': '35620',
               'msa_name': 'New York-Newark-Jersey City, NY-NJ-PA'}},
 {'msa/hcri': {'api_code': 0,
               'api_code_description': 'ok',
               'result': {'gross_yield_average': 0.0523,
                          'gross_yield_count': 3005346,
                          'gross_yield_median': 0.0506}},
  'msa_info': {'msa': '31080',
               'msa_name': 'Los Angeles-Long Beach-Anaheim, CA'}},
 {'msa/hcri': {'api_code': 0,
               'api_code_description': 'ok',
               'result': {'gross_yield_average': 0.0996,
                          'gross_yield_count': 3121132,
                          'gross_yield_median': 0.0956}},
  'msa_info': {'msa': '

In [26]:
# save response to json file for future reference
response_data = response.json()

filename = 'api_response.json'

# write the data to a JSON file
with open(filename, 'w') as file:
    json.dump(response_data, file, indent=4)

print(f"API response saved to {filename}")

API response saved to api_response.json


**#1. Flatten the JSON to prep it for a data frame**

Our first transormation is to flatten our JSON response from the api and get the data into a data frame that we can work with.

In [17]:
import pandas as pd

# api response variable
response_json = response.json()

# flattening the nested structure and prepare data for the data frame
data_for_frame = []
for item in response_json:
    flattened_item = {
        'msa': item['msa_info']['msa'],
        'msa_name': item['msa_info']['msa_name'],
        'api_code': item['msa/hcri']['api_code'],
        'api_code_description': item['msa/hcri']['api_code_description'],
        'gross_yield_average': item['msa/hcri']['result']['gross_yield_average'],
        'gross_yield_count': item['msa/hcri']['result']['gross_yield_count'],
        'gross_yield_median': item['msa/hcri']['result']['gross_yield_median']
    }
    data_for_frame.append(flattened_item)
    
# create df
api_df = pd.DataFrame(data_for_frame)

api_df.head()

Unnamed: 0,msa,msa_name,api_code,api_code_description,gross_yield_average,gross_yield_count,gross_yield_median
0,35620,"New York-Newark-Jersey City, NY-NJ-PA",0,ok,0.0661,5047844,0.0643
1,31080,"Los Angeles-Long Beach-Anaheim, CA",0,ok,0.0523,3005346,0.0506
2,16980,"Chicago-Naperville-Elgin, IL-IN-WI",0,ok,0.0996,3121132,0.0956
3,19100,"Dallas-Fort Worth-Arlington, TX",0,ok,0.0828,2273489,0.0814
4,26420,"Houston-The Woodlands-Sugar Land, TX",0,ok,0.0904,2253476,0.0893


**#2. Dropping irrelevant columns**

In [18]:
api_dfV2 = api_df.drop(columns=['api_code', 'api_code_description'])

api_dfV2.head()

Unnamed: 0,msa,msa_name,gross_yield_average,gross_yield_count,gross_yield_median
0,35620,"New York-Newark-Jersey City, NY-NJ-PA",0.0661,5047844,0.0643
1,31080,"Los Angeles-Long Beach-Anaheim, CA",0.0523,3005346,0.0506
2,16980,"Chicago-Naperville-Elgin, IL-IN-WI",0.0996,3121132,0.0956
3,19100,"Dallas-Fort Worth-Arlington, TX",0.0828,2273489,0.0814
4,26420,"Houston-The Woodlands-Sugar Land, TX",0.0904,2253476,0.0893


**#3. Rename column for consistency across data sources**

In [20]:
# renaming the 'msa' column to 'CBSA_msa_id' to be consistent with the other data sources
api_dfV3 = api_dfV2.rename(columns={'msa': 'cbsa_msa_id'})

api_dfV3.head()

Unnamed: 0,cbsa_msa_id,msa_name,gross_yield_average,gross_yield_count,gross_yield_median
0,35620,"New York-Newark-Jersey City, NY-NJ-PA",0.0661,5047844,0.0643
1,31080,"Los Angeles-Long Beach-Anaheim, CA",0.0523,3005346,0.0506
2,16980,"Chicago-Naperville-Elgin, IL-IN-WI",0.0996,3121132,0.0956
3,19100,"Dallas-Fort Worth-Arlington, TX",0.0828,2273489,0.0814
4,26420,"Houston-The Woodlands-Sugar Land, TX",0.0904,2253476,0.0893


**#4. Converting 'gross_yield_average' from decimal value to percentage value**

In [22]:
# creating a new df copy for consistancy
api_dfV4 = api_dfV3.copy()

# Converting 'gross_yield_average' from decimal to percentage
api_dfV4['gross_yield_average'] = api_dfV4['gross_yield_average'] * 100
api_dfV4 = api_dfV4.rename(columns={'gross_yield_average': 'prcnt_gross_yield_average'})

api_dfV4.head()

Unnamed: 0,cbsa_msa_id,msa_name,prcnt_gross_yield_average,gross_yield_count,gross_yield_median
0,35620,"New York-Newark-Jersey City, NY-NJ-PA",6.61,5047844,0.0643
1,31080,"Los Angeles-Long Beach-Anaheim, CA",5.23,3005346,0.0506
2,16980,"Chicago-Naperville-Elgin, IL-IN-WI",9.96,3121132,0.0956
3,19100,"Dallas-Fort Worth-Arlington, TX",8.28,2273489,0.0814
4,26420,"Houston-The Woodlands-Sugar Land, TX",9.04,2253476,0.0893


**#5. Converting 'gross_yield_median' from decimal value to percentage value**

In [23]:
# creating a new df copy for consistancy
api_dfV5 = api_dfV4.copy()

# Converting 'gross_yield_median' from decimal to percentage
api_dfV5['gross_yield_median'] = api_dfV5['gross_yield_median'] * 100
api_dfV5 = api_dfV5.rename(columns={'gross_yield_median': 'prcnt_gross_yield_median'})

api_dfV5.head()

Unnamed: 0,cbsa_msa_id,msa_name,prcnt_gross_yield_average,gross_yield_count,prcnt_gross_yield_median
0,35620,"New York-Newark-Jersey City, NY-NJ-PA",6.61,5047844,6.43
1,31080,"Los Angeles-Long Beach-Anaheim, CA",5.23,3005346,5.06
2,16980,"Chicago-Naperville-Elgin, IL-IN-WI",9.96,3121132,9.56
3,19100,"Dallas-Fort Worth-Arlington, TX",8.28,2273489,8.14
4,26420,"Houston-The Woodlands-Sugar Land, TX",9.04,2253476,8.93


The data is now ready to be joined alongside the flat file and web data.

In [24]:
# creating a csv of the final version of the API data
api_dfV5.to_csv("api_data_final_version.csv", index=False)