# Will a Chicago restaurant pass or fail its food safety inspection? 

### Data sources:

- [Food Inspections Dataset](https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5/data)

- [Median Household Income in Chicago, IL by Zip Code](http://zipatlas.com/us/il/chicago/zip-code-comparison/median-household-income.htm)



### Features for food inspection outcomes:
- *Risk* Level 1 - 3: Length of time since last inspection
- *Type of inspection* performed 
- *Inspection date*
- Average household *income* by zip
- The restaurant's *zip code*
- Zip code's average *population*


### Dealing with Data:

- *Combine both datasets* based on **Zip** Column

- [Food Inspections column explanations](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF)

### Due Dates and Workflow:


- [X] Scope - Due Tuesday, 1/28/20
- [X] Scope Lock - Due Wednesday, 1/29/20
- [X] Download data locally - Due Friday, 1/31/20
- [X] Data in postgres - Due Tuesday, 2/4/20: 
    - AWS SETUP: [Setup AWS_EC2](https://github.com/thisismetis/chi20_ds13/blob/master/curriculum/project-03/aws-setup/00_setup_aws_ec2.md)
    - Use 'SCP' to copy:
        - From  Local Terminal, and folder than can access your files, how to mv + ssh:
            - $ scp -i ~/path/to/mykey.pem (-r) 'actual_file_or_folder_with_-r_to_move.csv' ubuntu@[AWS_instance_public_ip]/folder/filename.csv
            - Double click here to see correct command to run - nothing is crossed off, it's just markdown:
                - scp -i ~/.ssh/aws_key.pem -r Restaurant_Data ubuntu@3.18.213.250:~/Restaurant_Data/
    - Install postgres on AWS: [Pandas & Postgres Setup](https://github.com/thisismetis/chi20_ds13/blob/master/curriculum/project-03/aws-setup/02_HW_setup_aws_and_psycopg.md)
    - Create DB 'restaurants' for Project 3:
        - Connect to AWS instance:
            - ssh -i ~/.ssh/aws_key.pem ubuntu@[ipAddresAWSinstance]
            - If successful, you'll see:
            
            Welcome to Ubuntu 18.04.3 LTS (GNU/Linux 4.15.0-1051-aws x86_64)
        - In Terminal type: psql
               - You should then see this:
                    psql (10.10 (Ubuntu 10.10-0ubuntu0.18.04.1))
                    Type "help" for help.
                    ubuntu=#
        - Create database:
            - ubuntu=# create database restaurants;
            - without the semicolon it doesn't commit
        - Connect to restaurants db:
            - ubuntu=# \connect restaurants;
    - Create Table Schemas manually
        - manually
    - Import Data using psql COPY
        - Examples of this in setup.sql file in the [SQL setup lecture](http://localhost:8888/notebooks/curriculum/project-03/sql-setup/01_SQL_Intro_and_Setup.ipynb)
- [X] MVP - Due Thursday, 2/6/20:
- [X] Done modeling - Due Friday, 2/7/20:
- [X] Visualizations Done - Due Monday 2/10 /20:
- [X] Pre-presentation day - Tuesday, 2/11/20:

### Workflow, after getting data:

1. Re-ask your question
2. Perform EDA
3. Start with a dummy clasifier -> assigns everything to the majority class:
    - How will i know if i'm overfitting or underfitting:
        - train/test split
        - train/validation/test split
        - cross validation/test
4. Metrics - pick which ones you want:
    - Area under curve, AOC & ROC
        - Start with this first to see what model is best
        - ROC & AOC take into account all the CM at different thresholds and is good if you're trying to compare models
    - Accuracy 
        - How many data points are you getting right, if you're dataset is imbalanced, don't use accuracy
    - Recall
        - Makes sure you catch as many people as possible
            - E.g. making sure you catch everyone with cancer
    - Precision
        - You want to be really accurate with your guesses
            - US government looking for terrorists
            - Don't want to hurt someone who isn't a criminal
    - F1
        - Precision and recall matters, so F1 balances those
    - Log loss
        - Good if you need to be really accurate in predictions
    - Confusion matrix (cm)
        - If you're deciding between accuracy, recall, precision, & F1:
            - Look at the CM and see which ones are getting misclassified and tune the threshold
                - Tune the balance between recall and precision
5. Modeling:
    - Dummy classifier to predict majority class
    - CV - train score and validation score
    - If train is .9 and val is .5, this shows the model is not generalizing well and you're overfitting
        - If this is the case, you can try other models with default parameters:
            - KNN: scale
                - If you're overfitting, make K higher
            - Logistic
                - Higher C => less regularization; if you want more simplicity then decrease C
            - SVM: scale
                - Kernel: 
                    - Linear: tune C, higher C means overfitting, so you'd want to lower it
                    - RBF: tune C and gamma
            - Naive Bayes: don't need to scale
                - Gaussian
                - Bernouli
                - Multi-nomial
            - CART models: don't need to sclae
                - Decision Trees
                    - Max depth, if your trees go to deep then you're overfitting
                    - Min leaf     
                - Random Forest
                    - N features
                    - Number of trees
                - XGBoost
6. Extra:
    - Grid Search:
        - For loop is written for you
            - Tune hyperparameters of model
        - Randomized GS
    - Scale:
        - Distances involved => scale your data
    - Threshold
    - Class imbalance 
        - Under-sample or over-sample; random or SMOTE
    - Ensembling
    - Feature engineering
7. Interpret:
    - Make a prediction
    - Look at the beta values and interpret them if you're doing logistic regression
    - Look at feature importance -> with any of the CART models
    - Look at outliers in your prediction, not in your data