# ACCY 571 Group Project  

## Overview 
-----

### Goal   

Complete a data analytics project that demonstrates your mastery of the course content.
  
  1.  Demonstrate the ability to apply machine learning and data analytics concepts from ACCY 570 to summarize data and to produce insightful visualizations.
  2.  Show that you can interact with a SQL database to extract relevant data.
  3.  Show that you can use text and network analyses to extract key insights from a rich data set.
  
### Prompt  

Your boss at Goliath National Bank sends you the frustratingly vague directive:

> We're interested in funding the development of new businesses such as 
> - restaurants,  
> - coffee shops, and  
> - bars
>  
> in the far off land of ____, but we need your team to 
> **get a better picture of the consumers** before we get make any investments.  

Your task is to pick a region or city and use the provided yelp database to characterize the eating/spending habits of of the resident consumers. Your boss hasn't specified exactly what she is expecting you to find, so it is up to you best figure out how to determine the best location.

-----

  
## Criteria
-----

You will work in groups of __4-5__ students to analyze the provided **Yelp** data set to make a recommendation,
based on the features present in this database.

You will complete three tasks for this group project:
1. A group report in the form of a Jupyter notebook,
2. An in-class presentation where your group will present your results, and 
3. Peer evaluation of the contributions of each member of your group.

Your final group report will be a single Jupyter notebook that will integrate Markdown, Python code, and the results from your code, such as data visualizations. Markdown cells should be used to explain any decisions you make regarding the data, to discuss any plots or visualizations generated in your notebook, and the results of your analysis. As a general guideline, the content should be written in a way that a fellow classmate (or an arbitrary data scientist/analyst) should be able to read your report and understand the results, implications, and processes that you followed to achieve your result. If printed (not that you should do this), your report should be at least fifteen pages.

Your group will present the material in-class in a format that is left up to each group. For example, you can use presentation software such as MS Powerpoint, PDFs, your Notebook, or Prezi, or, alternatively, you can choose some other presentation style (feel free to discuss your ideas with the course staff). The presentations should cover all steps in your analytics process and highlight your results. The presentation should take between eight to twelve minutes, and will be graded by your discussion teaching assistant.

### Rubric
  - Notebook Report (40%)
  - Class presentation (40%)
  - Peer assessment from your group-mates (20%)

### General

Your report should 
  1. use proper markdown, 
  2. include all of the code used for your analysis,
  3. include properly labeled plots (e.g., use axis labels and titles),
  4. use a consistent style between graphs, and
  5. be entirely the work of your own group, **Do not plagiarize code, this includes anything you might find online**.
  
All code should be written by you and your group.

-----

### Exploratory Data Analysis (EDA)

When exploring the database to determine how to pick the best location (and optionally what type of business or businesses to launch), some ideas to consider are:

- What types of restaurants are most popular and where?  
  - Can we predict the rating of different types of restaurants? 
  - How does your city compare to other major cities or nearby towns? 
  - I.e. do coffee shops fair better in Champaign than elsewhere?
- What can be learned from the review text itself? 
  - Whats the sentiment towards different types of eateries? 
  - What are reviewers talking about the most?
- Do users who visit one business tend to visit certain other businesses? 
  - I.e. do people who like seven saints tend to like _Distihl_? What about their friends?
- Is the restaurant selection diverse or homogenous? 
  - Are there a lot of a few types of restaurants? 
  - Has this been changing over time? 
  - Is there a demand for more types of food?
- How much does location matter? 
  - Are there central hubs where restaurants tend to do well in the reviews?

These questions are __NOT__ meant to be comprehensive, they are useful starting points. You should try to answer at least three major questions and at least one new question that your group comes up with on their own (i.e. not on the previous list).

-----

## Objectives

### Exposition

1. Break the overarching question **_What are consumer's eating habits?_**' into at least 3 smaller sub-questions. 
2. Explain how answering these contributes to answering the overarching question.

### Pull and Process Data

1. Use the yelp database to construct datasets to be used in your analysis.
2. Create features, preprocess, normalize, cluster, reduce dimensions, etc. as necessary.

### Analysis

For each of the questions your group decides to answer, approach them in the following manner:

1. Question 1
   1. Use graphs, machine learning, data aggregation, or anything else needed to answer the question
2. Question 2
   1. Use graphs, machine learning, data aggregation, or anything else needed to answer the question
3. Question 3
   1. Use graphs, machine learning, data aggregation, or anything else needed to answer the question
3. Question 4
   1. Use graphs, machine learning, data aggregation, or anything else needed to answer the question


-----

### Conclusion

Summarize your results, in plain english, as if this is the only part your boss will read. Note your boss is intelligent but has no interest in code or model output, she is only interested in words, pictures, and metrics.

-----

## Notes

- Completing the objectives will most likely __NOT__ proceed linearly. Querying data and subsequently analyzing these data will lead to new insights, which will mean extracting more data and performing new analyses. This entire process will also  influence the types of questions you can ask/answer.
- The overall [Database schematic][dbs] is available showing the data and their inherent relationships. This should be a starting point for your queries.
  - Note, we have not included the photo dataset, this keeps the data volume to a reasonable size.
- There is no unique solution to this project, each group should develop a different approach, analysis pipeline, and result.
- Your group should have fun with this! This is an open-ended project, which will be similar to what you face next semester in ACCY 575, and in the real world. The Yelp datasets contain a great deal of very interesting information, this is a chance for you to demonstrate to the class your mastery of the subject material, you should have fun exploring.

-----

[dbs]: https://s3-media2.fl.yelpcdn.com/assets/srv0/engineering_pages/9c5f7a89fd08/assets/img/dataset/yelp_dataset_schema.png

In [1]:
# Example query to pull reviews for all businesses in Champaign-Urbana
query = '''select b.*,
                  c.category,
                  r.*
           from business as b
           left join category as c
             on b.id = c.business_id
           left join review as r
             on b.id = r.business_id
           where b.city in ("Champaign", "Urbana")
           '''

with sql.connect('/home/data_scientist/accy571/readonly/data/yelp.db') as con:
    data = pd.read_sql(query, con)