# FYI directions, delete later

CRISP-DM Capstone: Association Rule Mining, Clustering, or Collaborative Filtering

In the final assignment for this course, you will be using one of three different analysis methods:

* Option A: Use clustering on an unlabeled dataset to provide insight or features
* Option B: Use transaction data for mining associations rules
* Option C: Use collaborative filtering to build a custom recommendation system

Your choice of dataset will largely determine the task that you are trying to achieve. Though the dataset does not need to change from your previous tasks. For example, you might choose to use clustering on your data as a preprocessing step that extracts different features. Then you can use those features to build a classifier and analyze its performance in terms of accuracy (precision, recall) and speed. Alternatively, you might choose a completely different dataset and perform rule mining or build a recommendation system.

Dataset Selection and Toolkits

As before, you need to choose a dataset that is not small. It might be massive in terms of the number of attributes (or transactions), classes (or items, users, etc.) or whatever is appropriate for the task you are performing. Note that scikit-learn can be used for clustering analysis, but not for Association Rule Mining (you should use R) or collaborative filtering (you should use graphlabcreate from Dato). Both can be run using Jupyter notebooks as shown in lecture.

* One example of a recommendation dataset is the movie lens rating data: http://grouplens.org/
datasets/movielens/
* Some examples of association rule mining datasets: http://fimi.ua.ac.be/data/

Write a report covering in detail all the steps of the project. The results need to be reproducible using only this report. Describe all assumptions you make and include all code you use in the Jupyter notebook or as supplemental functions. Follow the CRISP-DM framework in your analysis (you are performing all of the CRISP-DM outline). This report is worth 20% of the final grade.

# Lab 3: Zillow Dataset CRISP-DM Capstone: Association Rule Mining, Clustering, or Collaborative Filtering
MSDS 7331 Data Mining - Section 403 - Lab 3

Team: Ivelin Angelov, Yao Yao, Kaitlin Kirasich, Albert Asuncion

<a id="top"></a>
## Contents
* <a href="#Imports">Imports</a>
* <a href="#Business">Business Understanding</a>
* <a href="#Description">Dataset Description</a>
* <a href="#Attribute">Attribute Visualizion</a>
* <a href="#Train">Train and Adjust Parameters</a>
* <a href="#Evaluate">Evaluate and Compare</a>
* <a href="#Visualize">Visualize Results</a>
* <a href="#Ramifications">Summarize the Ramifications</a>
* <a href="#Deployment">Deployment</a>
* <a href="#Exceptional">Exceptional Work</a>
________________________________________________________________________________________________________

<a id="Imports"></a>

# Imports

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# load datasets here:
train_data = pd.read_csv('../input/train_2016_v2.csv')
data = pd.read_csv('../input/properties_2016.csv', low_memory=False)
data = pd.merge(data, train_data, how='left', on='parcelid')

'The dataset has %d rows and %d columns' % data.shape

<a href="#top">Back to Top</a>
<a id="Business"></a>
# Business Understanding
<b>10 points</b>

<i>
<b>Description:</b><br/>
Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). How will you measure the effectiveness of a good algorithm? Why does your chosen validation method make sense for this specific
dataset and the stakeholders needs?
</i>

<a href="#top">Back to Top</a>
<a id="Description"></a>
# Dataset Description
<b>10 points</b>

<i>
<b>Description:</b><br/>
Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?
</i>

<a href="#top">Back to Top</a>
<a id="Attribute"></a>
# Attribute Visualizion
<b>10 points</b>

<i>
<b>Description:</b><br/>
Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.
</i>

# FYI Modeling and Evaluation, delete later

Different tasks will require different evaluation methods. Be as thorough as possible when analyzing the data you have chosen and use visualizations of the results to explain the performance and expected outcomes whenever possible. Guide the reader through your analysis with plenty of discussion of the results. Each option is broken down by:

* [10 Points] Train and adjust parameters
* [10 Points] Evaluate and Compare
* [10 Points] Visualize Results
* [20 Points] Summarize the Ramifications

Option A: Cluster Analysis
* Train: Perform cluster analysis using several clustering methods (adjust parameters).
* Eval: Use internal and/or external validation measures to describe and compare the clusterings and the clusters— how did you determine a suitable number of clusters foreach method?
* Visualize: Use tables/visualization to discuss the found results. Explain each visualization in detail.
* Summarize: Describe your results. What findings are the most interesting and why?

Option B: Association Rule Mining
* Train: Create frequent itemsets and association rules (adjust parameters).
* Eval: Use several measure for evaluating how interesting different rules are.
* Visualize: Use tables/visualization to discuss the found results.
* Summarize: Describe your results. What findings are the most compelling and why?

Option C: Collaborative Filtering
* Train: Create user-item matrices or item-item matrices using collaborative filtering (adjust parameters).
* Eval: Determine performance of the recommendations using different performance measures (explain the ramifications of each measure).
* Visualize: Use tables/visualization to discuss the found results. Explain each visualization in detail.
* Summarize: Describe your results. What findings are the most compelling and why?

<a href="#top">Back to Top</a>
<a id="Train"></a>
# Train and Adjust Parameters
<b>10 points</b>

<i>
<b>Description:</b><br/>
depends
</i>

<a href="#top">Back to Top</a>
<a id="Evaluate"></a>
# Evaluate and Compare
<b>10 points</b>

<i>
<b>Description:</b><br/>
depends
</i>

<a href="#top">Back to Top</a>
<a id="Visualize"></a>
# Visualize Results
<b>10 points</b>

<i>
<b>Description:</b><br/>
depends
</i>

<a href="#top">Back to Top</a>
<a id="Ramifications"></a>
# Summarize the Ramifications
<b>20 points</b>

<i>
<b>Description:</b><br/>
depends
</i>

<a href="#top">Back to Top</a>
<a id="Deployment"></a>
# Deployment
<b>10 points</b>

<i>
<b>Description:</b><br/>
Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?
* How useful is your model for interested parties (i.e., the companies or organizations
that might want to use it)?
* How would your deploy your model for interested parties?
* What other data should be collected?
* How often would the model need to be updated, etc.?
</i>

<a href="#top">Back to Top</a>
<a id="Exceptional"></a>
# Exceptional Work
<b>10 points</b>

<i>
<b>Description:</b><br/>
You have free reign to provide additional analyses or combine analyses.
</i>

<a href="#top">Back to Top</a>