# Milestone 1 Assignment - Capstone Proposal

## Author - Mike Pearson

### Capstone Project Instructions
Select a problem and data sets of particular interest and apply the analytics process to find and report on a solution.

Students will construct a simple dashboard to allow a non-technical user to explore their solution. The data should be read from a suitable persistent data storage, such as an Internet URL or a SQL data base.

The process followed by the students and the grading criteria include:
<ol style="list-style-type: lower-alpha;">
<li>Understand the business problem <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span></li>
<li>Evaluate and explore the available data <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span></li>
<li>Proper data preparation <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span> <span class="label" style="border-radius: 3px; background-color: royalblue; color: white;">Milestone 2</span></li>
<li>Exploration of data and understand relationships <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span> <span class="label" style="border-radius: 3px; background-color: royalblue; color: white;">Milestone 2</span></li>
<li>Perform basic analytics and machine learning, within the scope of the course, on the data.  <span class="label" style="border-radius: 3px; background-color: royalblue; color: white;">Milestone 2</span> <span class="label" style="border-radius: 3px; background-color: slateblue; color: white;">Milestone 3</span> <BR/>For example, classification to predict which employees are most likely to leave the company.</li>
<li>Create a written and/or oral report on the results suitable for a non-technical audience. <span class="label" style="border-radius: 3px; background-color: slateblue; color: white;">Milestone 3</span></li>
</ol>




## Tasks
For this proposal, you are to:
1. Generate or describe a solvable business problem and outline the flow of data needed to address the problem.
2. Identify 2 or more available data sets
3. Report on the statistics of each data set to include: type, unique values, missing values, quantile statistics, descriptive statistics, most frequent values, and histogram. Include analysis statements based on results.
4. Perform data preparation based on analysis of the quality of the available data include concatenation method, imputation method(s), dealing with outliers, and binning/scaling transformation.
5. Output the resulting data into a new data file
6. Identify potential machine learning model(s)


## Problem Definition



### I am (in this project) a data scientist at Zillow and I want to come up with a better prediction model for house prices. Zillow has begun a business group that essentially flips houses - purchases them from a seller who wants a quick and guaranteed price - so it is important that Zillow has a very good idea of what the house might sell for in an open, time-unconstrained market.

### With this in mind, I think that the selling price of homes might be related in some way with the stock prices of large local companies. It's possible that if the stock of Amazon (for example) is riding high, then the price of homes may be driven up as buyers have more resources to purchase.  And the overall stock market may indicate in general a growing economy.

### So I will use the King County real estate data set from the DataSci410: Methods for Data Science final milestone project, plus stock prices in the largest local companies, in terms of employees and local impact, that are publically traded (Amazon, Boeing, Microsoft, Starbucks, Walmart) - along with the Dow Jones Industrial average.

### I used an article in TripSavvy https://www.tripsavvy.com/biggest-seattle-area-employers-2965051

### to determine who the largest local employers were. Some of then (UW, Lewis McCord, Providence Health) aren't traded on the market.

## Data Sets

The stock and Dow Jones Industrial data I downloaded from Yahoo Finance and put the .csv files in my github depository


In [8]:
# Datasets location
fileName = "https://library.startlearninglabs.uw.edu/DATASCI410/Datasets/kc_house_data.csv"

## The Dow Jones Industrial closing value for the period May 2, 2014 to May 27, 2015
URL2 = 'https://raw.githubusercontent.com/mutecypher/DataSci-420/master/DJI.csv'

## Amazon's closing price for the period May 2, 2014 to May 27, 2015
URL3 = 'https://raw.githubusercontent.com/mutecypher/DataSci-420/master/AMZN.csv'

## Boeing's closing price for the period May 2, 2014 to May 27, 2015
URL4 = 'https://raw.githubusercontent.com/mutecypher/DataSci-420/master/BA.csv'

## Microsoft's closing price for the period May 2, 2014 to May 27, 2015

URL5 = 'https://raw.githubusercontent.com/mutecypher/DataSci-420/master/MSFT.csv'


## Starbucks' closing price for the period May 2, 2014 to May 27, 2015

URL6 = 'https://raw.githubusercontent.com/mutecypher/DataSci-420/master/SBUX.csv'

## Walmart's closing price for the period May 2, 2014 to May 27, 2015

URL7 = 'https://raw.githubusercontent.com/mutecypher/DataSci-420/master/WMT.csv'



## Profile Reports & Analysis Statements

In [9]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import seaborn as sbn

In [10]:
## Get the housing data
Trippy = pd.read_csv(fileName)
Housing = pd.DataFrame(Trippy)

## Get the Amazon data

AMZN = pd.read_csv(URL3)

## Get the Boeing data - BA is their stock symbol

BA = pd.read_csv(URL4)

## Get the Microsoft data

MSFT = pd.read_csv(URL5)

## Get the Starbucks data

SBUX = pd.read_csv(URL6)

## Get the Walmart data

WMT= pd.read_csv(URL7)

## Get the Dow Jones Industrial average data

DJI = pd.read_csv(URL2)


## Data Preparation



In [11]:
## I find it easier to deal with the data sets if I create columns for datetime, day, month, year
## This code does that for each data set

## Housing
end = len(Housing)- 1
for i in range(0, end):
    dummy = str(Housing.loc[i,'date'])
    dummy = dummy[0:8]
    mummy = dt.datetime.strptime(dummy,"%Y%m%d" )
    Housing.loc[i,'datetime']= mummy
    
Housing['year'] = Housing['date'].astype(str).str[0:4]
Housing['year'] = Housing['year'].astype(int)
Housing['month'] = Housing['date'].astype(str).str[4:6]
Housing['month'] = Housing['month'].astype(int)
Housing['day'] = Housing['date'].astype(str).str[6:8]
Housing['day'] = Housing['day'].astype(int)
Housing.sort_values(['year','month', 'day'], ascending = (True, True, True), inplace = True)

## Amazon

end = len(AMZN)
AMZN['year'] = np.random.randint(1,20,end)
AMZN['month'] = np.random.randint(1,20,end)
AMZN['day'] = np.random.randint(1,20,end)
for i in range(0, end):
    dummy = str(AMZN.loc[i,'Date'])
    ##mummy = parse(dummy)
    AMZN.loc[i,'datetime']= mummy
    AMZN.loc[i,'year'] = mummy.year
    AMZN.loc[i,'month'] = mummy.month
    AMZN.loc[i,'day'] = mummy.day

print("\nAmazon head is ", AMZN.head())
## Boeing


end = len(BA)
BA['year'] = np.random.randint(1,20,end)
BA['month'] = np.random.randint(1,20,end)
BA['day'] = np.random.randint(1,20,end)
for i in range(0, end):
    dummy = str(BA.loc[i,'Date'])
    BA.loc[i,'datetime']= mummy
    BA.loc[i,'year'] = mummy.year
    BA.loc[i,'month'] = mummy.month
    BA.loc[i,'day'] = mummy.day
    
## Microsoft

end = len(MSFT)
MSFT['year'] = np.random.randint(1,20,end)
MSFT['month'] = np.random.randint(1,20,end)
MSFT['day'] = np.random.randint(1,20,end)
for i in range(0, end):
    dummy = str(MSFT.loc[i,'Date'])
    MSFT.loc[i,'datetime']= mummy
    MSFT.loc[i,'year'] = mummy.year
    MSFT.loc[i,'month'] = mummy.month
    MSFT.loc[i,'day'] = mummy.day
    

## Starbucks

end = len(SBUX)
SBUX['year'] = np.random.randint(1,20,end)
SBUX['month'] = np.random.randint(1,20,end)
SBUX['day'] = np.random.randint(1,20,end)
for i in range(0, end):
    dummy = str(SBUX.loc[i,'Date'])
    SBUX.loc[i,'datetime']= mummy
    SBUX.loc[i,'year'] = mummy.year
    SBUX.loc[i,'month'] = mummy.month
    SBUX.loc[i,'day'] = mummy.day
    
## Walmart

end = len(WMT)
WMT['year'] = np.random.randint(1,20,end)
WMT['month'] = np.random.randint(1,20,end)
WMT['day'] = np.random.randint(1,20,end)
for i in range(0, end):
    dummy = str(WMT.loc[i,'Date'])
    WMT.loc[i,'datetime']= mummy
    WMT.loc[i,'year'] = mummy.year
    WMT.loc[i,'month'] = mummy.month
    WMT.loc[i,'day'] = mummy.day


## Dow Jones Industrial Average

end = len(DJI)
DJI['year'] = np.random.randint(1,20,end)
DJI['month'] = np.random.randint(1,20,end)
DJI['day'] = np.random.randint(1,20,end)
for i in range(0, end):
    dummy = str(DJI.loc[i,'Date'])
    DJI.loc[i,'datetime']= mummy
    DJI.loc[i,'year'] = mummy.year
    DJI.loc[i,'month'] = mummy.month
    DJI.loc[i,'day'] = mummy.day




Amazon head is       Date        Open        High         Low       Close   Adj Close  \
0  5/2/14  310.420013  313.290009  304.309998  308.010010  308.010010   
1  5/5/14  306.369995  310.230011  305.000000  310.049988  310.049988   
2  5/6/14  309.529999  309.809998  297.040009  297.380005  297.380005   
3  5/7/14  295.559998  296.399994  286.679993  292.709991  292.709991   
4  5/8/14  290.820007  295.880005  287.230011  288.320007  288.320007   

    Volume  year  month  day   datetime  
0  3995100  2015      1   16 2015-01-16  
1  2519900  2015      1   16 2015-01-16  
2  4682300  2015      1   16 2015-01-16  
3  7015200  2015      1   16 2015-01-16  
4  3848200  2015      1   16 2015-01-16  


## Code Output

## Machine Learning Model(s)