# Capital One Data Challenge

__Statement of Work__:  
Real Estate company wants to invest in 2-bedroom rental properties in New York.  
Using data from Zillow and AirBnB, find the most profitable zip codes for investment.

----

Zillow - home cost data  
AirBnB - rent revenue data

Normally, Profit = Revenue - Cost

But:
- Home cost is large & one-time . Data is aggregated, but not current.
- Rent revenue is small & recurring . Data is current, but not aggregated.

So:
- estimate current home cost using previous years
- aggregate rent revenue by zipcode and estimate per year
- join datasets and filter New York & 2-bed properties
- sort by break-even years (cost/revenue ratio)

Assuming Ideal-Data*, code looks like:
> SELECT  
	ZipCode,  
	z.Total_Cost,  
	a.Revenue_Per_Year,  
	Total_Cost / Revenue_Per_Year AS BreakEven_Years  
FROM	Airbnb a  
JOIN 	Zillow z on a.zipcode = z.zipcode  
WHERE  
	a.Bedrooms = 2 AND z.City = 'New York'  
ORDER BY  
	BreakEven_Years  

----

## 1 . KISS - Keep It Short & Simple

#### No data cleaning or estimation - discard bad data and use available data

In [None]:
import pandas, pandasql, qgrid, plotly, plotly_express

# Read Data
zillow = pandas.read_csv('Zip_Zhvi_2bedroom.csv')
airbnb = pandas.read_csv('listings.csv',low_memory=False)

# Clean & Format datatypes
zillow['RegionName'] = zillow['RegionName'].astype(str)
zillow['2017-06'] = zillow['2017-06'].astype(int)
airbnb['price'] = airbnb['price'].str.replace('\$|,|\.00','').astype(int)
airbnb['last_review'] = pandas.to_datetime(airbnb['last_review'])

# SQL Query
query = """
SELECT
    zipcode,
    neighbourhood_group_cleansed AS Area,

    count(price) AS Properties,  -- no. of properties, to find popular areas
    sum(number_of_reviews) AS Reviews,  -- no. of reviews, to find popular properties
    
    [2017-06] AS [Cost$],  -- use 2017 cost
    (cast(AVG(price) * 365 * .75 as int))/100*100 AS [Revenue$/Year],  -- use daily price and 75% occupancy
    
    round([2017-06]/(AVG(price) * 365 * .75),2) AS BreakEven_Years
FROM
    airbnb a
JOIN
    zillow z ON a.zipcode = z.RegionName
WHERE 1=1
    AND a.bedrooms = 2 AND z.city = 'New York'  -- filter client requirements

    AND a.number_of_reviews > 2 AND a.last_review > date('now','-2 years')  -- filter popular properties
GROUP BY
    zipcode
HAVING
    Properties > 2  -- filter popular areas
ORDER BY
    BreakEven_Years
"""

az = pandasql.sqldf(query)

# Interactive Grid
display(qgrid.show_grid(az)) 


plotly.offline.init_notebook_mode()
# ROI Quadrant
fig = plotly_express.scatter(az, x="Revenue$/Year", y="BreakEven_Years", color="Properties", size="Cost$",
                             hover_data=['Cost$','Area','zipcode'],
                             title='ROI Quadrant : Bottom-Right - High Revenue & Quick BreakEven')
fig.show()

### TL;DR - Too Long; Didn't Read

__Popular market__ : Invest `$`2 million in a Manhattan __10022__ property. Earn `$`125K/year in rent. Break Even in 15 years.  
    OR  
__Niche market__ : Invest `$`400K in Queens __11434__. Earn `$`40K/year. Break Even in 12 years.  

#### WARNING: Invest at your own risk. Data subject to change. Filter conditions are subjective.

In [None]:
# Part 2 .. continued in cap1_irl