## Final project


You should work in a group of 3-5 people (no more than 5). The
purpose of the project is to provide you with machine learning experience and real data analysis experience.


There are two directions you can choose:

- Big data analysis
- ML algorithm implementation

#### Big data analysis 

- Posing questions
- Finding data sources
- Exploring your data
- Statistical machine learning analysis
- Comparing methods and visualize the results
- Summary your findings


#### ML algorithm implementation 

- Identify an algorithm to be implemented 
- Understand the model
- Implement the algorithm in python without using black-box packages
- Test your algorithm in simulated data
- Run your algorithm in a real data
- Summary and visualize the result


Please work collaboratively with a team. You will get a chance to present your findings through writing and speaking.


### Getting Started

I recommend you get started by brainstorming project topics or themes that interest your group. To narrow down your project to just one topic, think about:

_For big data analysis direction_, ask:

- What questions does your topic address, or what problems does your topic solve? Why this question is meaningful to your team (or simply you)? How does this project help your future career?
- What's the challenging part about your topic? Why are they challenging?
- Are there credible, public datasets available to explore the topic? (See below for some suggested data sources.)
- Is a 5 $\sim$ 6-week project long enough to explore the topic reasonably well?

_For ML algorithm implementation direction_, ask

- Why are you interested in this algorithm? Why is this algorithm meaningful to your team (or simply you)? How does this project help your future career?
- What is the challenging part about your topic? Why are they challenging?
- Is there any open code already available? Then how do you do differently compared to those codes
- Are you familiar with conducting simulation studies? 
- Can you find publicly available data to test your algorithm?
- Is a 5 $\sim$ 6-week project long enough to explore the topic reasonably well?


Make sure that everyone in the group agrees on the topic. You will need a certain curiosity about your topic in order to stay motivated throughout the quarter.

Once you've selected a project topic, you can start working on the project proposal and the project itself.


The final project has four components:
          
                                -----------------------------------------
                                         Proposal | Due May 19 11:59pm
                                     Presentation | on Week 10 Zoom (online)
                                          Report  | Due June 13 11:59pm 
                                 Group Evaluation | Due June 14 11:59pm 
                                ------------------------------------------

### Proposal

Your group should submit a 1-2 page project proposal (1 per group) by May 19 11:59pm. Your proposal should include answers to those questions in _Getting started_ section. Also, your proposal should include

- Your group members' name
- Which direction your team is going to pursue (big data analysis or ML algorithm implementation. You can do both. If success, you will receive bonus points) 
- What's the topic of your project? 
- What data source(s) will your team use? Briefly describe each data source and Provide a link for each data source. This is a check to make sure that there is actually data available for your topic. If you ultimately decide not to use some of the data sources, or find additional data sources later, that's okay.
- What statistical methods will your team use?
- What makes your project challenging? Consider that you will have 5 weeks to work on the project. Do not pick a project that is too hard!

The proposal is your best opportunity to get feedback on your project. Make sure it's clear and addresses the questions above. You can also use the proposal to tell me about any other comments or concerns you have about your project topic. You do not need to present any data analyses in the proposal.

The proposal will be graded satisfactory/unsatisfactory (5 points). Your priority should be working on the project itself; __don't__ spend more than a few hours working on the proposal. Make sure all group members have read the proposal and agree on what it says. 

Submit your proposal on Canvas. Each group only need to submit one file. 
The proposal should be _.pdf_ or _.html_ (we __do not__ accept _.doc_ file.)

### Presentation

In week 10, each group will present preliminary results from their project to the class. Each group can either elect a leader to present results or choose to let everyone in the team to present. You are required attend and provide feedback on presentations from some of the other teams. More details about presentations will be released in week 8.

### Grading criteria

The final report is due in finals week. There is no page limit for the final report ($\sim$10 to 15 pages seems a reasonable range). But the final report is graded based on quality not quantity!

The final project should be a “.pdf” file and its format should be similar to a statistical paper.

The report should at least contain the following components:

* Introduction (2 points): describe your problem, introducing dataset and mathematical notations to be used in the report if necessary


* Proposed method (3 points): formally state the model, method, and the algorithm you are using: give necessary details. Describe all the notations introduced in this section.


* Real data study (4 points): making plots and tables to summarize your finding based on a real dataset; check model assumptions. An important aspect we are looking at is that you justify the use of the method. For example, if there is a clear linear relationship, it will be unnecessary to use a nonparametric method. 

* For ML projects:
    - Simulation study (4 points): you should run simulation to check if your code and/or the method works; this is also the place you test the method if the model assumption(s) do(es) not hold.
  
  For data science projects:
    - visualization (4 points): explain your conclusion using plots. Plots should have clear title, axis, and labels. Plots should support your conclusion and display nicely in the final report. 


* Conclusion or summary (2 points)


* Reference and acknowledgement (1 point)


* Appendix (2 points). This is the place you put technical stuffs and code. You can put mathematical proofs (if you have) into this section. You should put your code in this section. Please follow the same principal as you do in your homework. To be clear is more important than to be technical, make sure even your friends who do not take the class can understand your writing! Putting a lot of jargons and equations without explaining them is bad!

* Amount of effort (2 points)

Grading scales:

                           ------------------------------- 
                               Grade Points | Points
                           -------------------------------    
                                       Good | full credit
                               Satisfactory | half credit
                            Poor or no work | 0
                           -------------------------------

### Data sources

- Yale face dataset: http://vision.ucsd.edu/content/yale-face-database
- UCI machine learning data repository: https://archive.ics.uci.edu/ml/index.php
- The MINST handwriting database: http://yann.lecun.com/exdb/mnist/
- The Reuters Dataset: https://martin-thoma.com/nlp-reuters/
- Kaggle competition https://www.kaggle.com/
- Yahoo dataset https://webscope.sandbox.yahoo.com/
- Yahoo finance dataset https://finance.yahoo.com/quotes/OCR,dataset/view/v1/
- Implement one of the algorithm described in this course, and try to scale it to large datasets.


### ML algorithms
- Latent Dirichlet allocation (LDA) model: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf (also, see [Wikipedia](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation))
- Generative adversarial network (GAN): https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf [Wikipedia](https://en.wikipedia.org/wiki/Generative_adversarial_network)
- Markov Chain Sampling Methods for Dirichlet Process Mixture Models: https://www.jstor.org/stable/1390653?seq=1
- Gaussian processes
- Conformal prediction
- Bayesian additive regression trees (BART): https://arxiv.org/abs/0806.3286
- Sparse Bayesian factor analysis: https://www.tandfonline.com/doi/abs/10.1080/01621459.2015.1100620
- Topics in ELS: e.g,
    - group lasso 
    - Additive trees/generalized trees
    - Neural networks
    - Sparse PCA and Sparse factor analysis
    - Other topics include manifold learning, variational inference, etc...
    