# Appendix B - Building and Deploying the App

* [Part 1 - The Webservice](#webservice)
* [Part 2 - The Shiny User Interface](#shiny)
* [Part 3 - The Static Web Site](#squarespace)
* [References](#references)

<a id="webservice"></a>
## Part 1 - The Webservice


The code for the webservice can be found in the ./webservice subdirectory. This is a copy of the code running in production. The reference origin git repo is *not* in this repository. It is hosted on [Heroku](https://www.heroku.com/). 

### How the Webservice Works

The webservice is a minimal [Flask](http://flask.pocoo.org/) application used to provide prediction probabilities for a given candidate. There is one useful entry point: `/predict` that expects the user's admissionTest, AP, etc on the query string. It will return a JSON file consisting of the probabilities of getting into each college.

Sample input:

```
http://boiling-forest-8250.herokuapp.com/predict?admissionstest=0.926899206&AP=7&averageAP=1.06733864&SATsubject=0.324271565&GPA=-0.187109979&schooltype=0&intendedgradyear=2017&female=1&MinorityRace=0&international=0&sports=0&earlyAppl=0&alumni=0&outofstate=0&acceptrate=0.151&size=6621&public=0&finAidPct=0&instatePct=0
```

Sample output:
```
{
  "preds": [
    {
      "college": "Princeton",
      "prob": 0.26166666666666666
    },
    {
      "college": "Harvard",
      "prob": 0.23999999999999999
    },
    {
      "college": "Yale",
      "prob": 0.23999999999999999
    },
    ...
 ]
}
```

### Webservice startup

Upon getting the first `/predict` request, the web service will perform the same logic as the classification iPython notebook. It loads the normalized college data, imputes missing values, and runs Scikit-Learn's Random Forest classification. The resulting classifier is kept in memory as a Python global variable to service subsequent prediction requests. There is no locking at the present time. 

### Webservice Dependencies - OpenShift

We started using OpenShift. The free account worked fine at first and then had consistent and terrible performance problems a few days before the project was due. With two days remaining, we scrambled and moved from OpenShift to Heroku. This section serves as reference (and consider it a warning to not use OpenShit ever again).

Since the webservice is running the full Pandas and Scikit-Learn stacks, these had to installed on the OpenShift cartridge. Here's what was done:

1. Create an OpenShift account
1. Install the [client tools](https://developers.openshift.com/en/managing-client-tools.html). This will install `rhc`, the necessary local command line tool for managing OpenShift apps.
1. Use the Flask Quickstart template ([details](https://developers.openshift.com/en/python-flask.html))

    ```
    rhc app create myflaskapp python-2.7 --from-code=https://github.com/openshift-quickstart/flask-base.git
    ```

1. This will create a local myflaskapp git repository. Go into this repository: `cd myflaskapp`
1. SSH into the app and install the dependent packages:

    ```
    rhc ssh myflaskapp
    source ~/python/virtenv/activate
    pip install numpy
    ```

    The `pip install` has to be repeated for `scipy, pandas` and `scikit-learn`. This takes a while as it is compiled     locally on the OpenShift instance and may not have optimal performance.

1. After all the packages have been installed, take the output of `pip freeze` and update the `requirements.txt` in the *local* repository.
1. At this point, you can grab the appropriate files from `./webservice` directory, notably: `TIdatabase.py, collegelist.csv, collegedata_normalized.csv, flaskapp.py`.



#### DevOps Notes

To see the logs, use `rhc tail -o '-n 100' mypythonapp`

Common rhc commands can be found [here](https://developers.openshift.com/en/managing-common-rhc-commands.html)

### Webservice Dependencies - Heroku

Heroku was easier to configure since there are buildpacks available that contain the entire Condas stack with all the Scipy, Numpy, Scikit-learn dependencies. There were still numerous gotchas, mainly related to finding the right
combination of scipy, numpy and scikit-learn versions that would all play nicely together.

Heroku has a [nice walkthrough](https://devcenter.heroku.com/articles/getting-started-with-python#introduction) about setting up a Python app in minutes. I mostly followed that, with the following changes:

Add the Conda buildpack:
```
heroku config:add BUILDPACK_URL=https://github.com/kennethreitz/conda-buildpack.git
```

This buildpack has a broken scipy, so it was obtained from:
```
heroku buildpacks:set https://github.com/thenovices/heroku-buildpack-scipy
```

This version scipy is broken with the latest scikit-learn, so I had to downgrade scikit-learn. Here is our final requirements.txt file:
```
gunicorn==19.3.0
psycopg2==2.6
SQLAlchemy==1.0.4
whitenoise==1.0.6
Flask==0.10.1
pandas==0.17.1
numpy==1.9.1
scipy==0.15.1
scikit-learn==0.16.1
nose==1.3.7
```

and our Procfile
```
web: gunicorn flaskapp:app --log-file=-
```

After the pain and suffering with OpenShift's free account, we went with the Heroku \$7/mo hobbyist dyno with the hope that the app would not go down again.

### Programming Notes

The real work is done in `flaskapp.py`.

Logging is off by default. To log errors from your app, use:

```
import logging

logging.basicConfig(level=logging.DEBUG,format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
```

We simply used global variables to stored information. It is also possible to use the [appcontext](http://flask.pocoo.org/docs/0.10/appcontext/). 

Note that the app can be tested locally very easily. From a local shell use: `python flaskapp.py`. It will say which address / port it is listening on when starting up.

### Consuming the Webservice from R

Sample code to consume the webservice can be found in `rclient.R`. This simulates how the production Shiny app can invoke the webservice. An R data.frame is created with the normalized user inputted values. This is used to populate the query string of the webservice. Note that the webservice ignores the last five variables, which are specific to a given college, since probabilities for *all* colleges are returned.

The returned JSON is easily parsed into an R data.frame for presentation to the user or further manipulation. Here is a snippet:

```
# create query string
qs = paste0(colnames(pred),"=",pred[1,],collapse="&")
server = "http://127.0.0.1:5000/predict"
server = "http://boiling-forest-8250.herokuapp.com/predict"

URL = paste0(server,"?",qs)

js  = fromJSON(URL)
df = js$preds
df$college = as.factor(df$college)
summary(df)

```




## The Webservice Code

(This is not in a code cell because it is not meant to be executed)



```
from flask import Flask
from flask import jsonify, request

import os
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import logging

import TIdatabase as ti

app = Flask(__name__)

clf = None
logging.basicConfig(level=logging.DEBUG,format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 

ws_cols = ["admissionstest","AP","averageAP","SATsubject","GPA","schooltype",
                  "female","MinorityRace","international","sports",
                  "earlyAppl","alumni","outofstate"]
college_cols = ["acceptrate","size","public"]
predictor_cols = ws_cols + college_cols

cols_to_drop = ['classrank', 'canAfford', 'firstinfamily', 'artist', 'workexp', 'visited', 'acceptProb',
                'addInfo','intendedgradyear']
NUM_ESTIMATORS = 1000

colleges = ti.College()

def load_classifier():
    global clf
    df = pd.read_csv(os.path.join(os.path.dirname(__file__),"collegedata_normalized.csv"), index_col=0)
    dfr = df.drop(cols_to_drop,axis=1)
    dfr = dfr[pd.notnull(df["acceptStatus"])]
    dfpredict = dfr[predictor_cols]
    dfresponse = dfr["acceptStatus"]
    imp = Imputer(missing_values="NaN", strategy="median", axis=1)
    imp.fit(dfpredict)
    X = imp.transform(dfpredict)
    y = dfresponse
    clf = RandomForestClassifier(n_estimators=NUM_ESTIMATORS, criterion="gini")
    clf.fit(X,y)
    return clf

def genPredictionList(vals):
    """
    vals (coming from the request arguments) is a list of tuples [('name1','val1'),('name2','val2')...]
    """
    global ws_cols
    global clf
    global colleges
    X = pd.Series(dict((name, float(val)) for name, val in vals))
    if clf is None: load_classifier()
    preds = []
    for i, row in colleges.df.iterrows():
        X[college_cols] = row[college_cols]
        y = clf.predict_proba(X[predictor_cols])[0][1]
        p = {'college':row.collegeID, 'prob':y}
        preds.append(p)
    return preds
    #e.g.  [{'college':'harvard', 'prob':y}, {'college':'yale', 'prob':0.25}, {'college':'brown', 'prob':0.89}]

@app.route('/')
def hello_world():
    return "Welcome to the Team Ivy Web Service"

@app.route("/predict")
def predict():
    preds = genPredictionList(request.args.iteritems())
    return jsonify(preds = preds)


if __name__ == '__main__':
    app.run(debug=True)

```

<a id="shiny"></a>
## Part 2 - The Shiny Web Application

[Shiny](http://shiny.rstudio.com/) is a web application framework for R. It allows rapid development of reactive web applications. In this project, Shiny is used to implement all user interaction including plots and charts.

The Shiny app is hosted at http://www.shinyapps.io/

We initially attempted to implement RandomForests in R, but the results were not consistent with the Python classification code. It was at that time that we decided to implement the Python webservice. 

The Shiny app consists of three files:



`global.R` is common code for both the client and server. When doing the classification, we saved the normalization
means and standard deviations so we could normalize user input precisely the same way. We load in those normalization values upon app startup.

```
## shape data for MST
library(dplyr)
load('model.RData')
act2sat<-data.frame(sat=list(1600, 1560, 1510, 1460, 1420, 1380, 1340, 1300, 1260, 1220, 1190, 1150, 1110, 1070, 1030, 990, 950, 910, 870,
                         830, 790, 740, 690, 640, 590, 530),act=list(36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19,
                                                                  18, 17, 16, 15, 14, 13, 12, 11))

datanormed<-read.csv("collegedata_normalized.csv")
datanormed$SATsubject<-as.numeric(datanormed$SATsubject)

drops <- c("X","studentID","classrank","GPA_w","program","intendedgradyear","addInfo","canAfford",
           "MinorityGender","firstinfamily","artist","workexp","collegeID","visited")
dataunnormed$female[dataunnormed$female<1]<-0
dataunnormed$MinorityRace[dataunnormed$MinorityRace<1]<-0
dataunnormed$schooltype[dataunnormed$schooltype<1]<-0
dataunnormed$earlyAppl[dataunnormed$earlyAppl<1]<-0
dataunnormed$outofstate[dataunnormed$outofstate<1]<-0
dataunnormed$alumni[dataunnormed$alumnil<1]<-0
dataunnormed$international[dataunnormed$international<1]<-0
dataunnormed$sports[dataunnormed$sports<1]<-0
dataunnormed<-plyr::rename(dataunnormed,c("admissionstest"= "ACT/SAT score",
                                          "GPA"= "GPA",
                                          "averageAP"= "average AP score",
                                          "AP"= "# AP exams taken",
                                          "SATsubject"= "# SAT subject tests taken",
                                          "female"= "female indicator",
                                          "schooltype"= "private high-school indicator",
                                          "MinorityRace"= "minority indicator",
                                          "earlyAppl"= "early application",
                                          "outofstate"= "out of state indicator",
                                          "alumni"= "alumni indicator",
                                          "international"= "international indicator",
                                          "sports"= "varsity sports indicator"))
subdata<-dataunnormed[,1:27]
subdata<-subdata[,!(names(subdata) %in% drops)]


normedmeans<-read.csv("normalize_means.csv")
normedstds<-read.csv("normalize_stds.csv")

better_labels<-c("varsity sports indicator",
                 "international indicator",
                 "alumni indicator",
                 "public university indicator",
                 "out of state indicator",
                 "early application",
                 "minority indicator",
                 "private high-school indicator",
                 "female indicator",
                 "# SAT subject tests taken",
                 "# AP exams taken",
                 "college size",
                 "average AP score",
                 "GPA",
                 "ACT/SAT score")

```


`ui.R` contains the user interface
```

library(dplyr)
library(shiny)
library(shinyBS)
require(ggplot2)
require(reshape2)
require(plyr)
library(randomForest)
library(ggvis)




shinyUI(fluidPage(
  titlePanel("Chance me."),
  fluidRow(column(12,align="center",

           wellPanel(
              sliderInput("sat", "SAT composite score",
                         0, 2400, 1200, step = 10),
              sliderInput("act", "ACT composite score",
                          0, 36, 18, step = 1),
              sliderInput("gpa", "GPA (unweighted)", 0,4, 2, step = .1),
              sliderInput("apnum", "Number of AP exams taken", 0,10, 5, step = 1),
              sliderInput("apave", "Average AP score",
                         0, 5, 2.5, step = .1),
              sliderInput("sat2ave", "Number of SAT Subject tests taken",
                          0, 10, 5, step = 1)

  )
,column(4,selectInput("hs","What type of high school did you attend?",
                        c("Public"="0","Private"="1","Parochial"="2","Homeschool"="3")),
           radioButtons("gender","What gender do you identify as?",
                        c("Female"="1","Male"="-1","Other"="0")),
           selectInput("race","What ethnicity do you identify as?",
                       c("African American/Black"="1","Hispanic/Latino"="2",
                         "Asian"="0","Middle Eastern"="-1","Pacific Islander"="3",
                         "Native American"="4","White"="-2","Other"="5"))
           
        
        
),
column(4, radioButtons("international","Are you a foreign national?",
                       c("No"=0,"Yes"=1)),
       radioButtons("firstinfamily","Are you the first in your family to attend university?",
                    c("No"=0,"Yes"=1)),
       
       radioButtons("sports","Do you play varsity athletics?",
                    c("No"=0,"Yes"=1))                   ),
       column(4, radioButtons("alum","Are you a legacy at this school?",
                              c("No"=0,"Yes"=1)),
              radioButtons("out","Are you applying from out of state?",
                           c("No"=0,"Yes"=1)),
              radioButtons("early","Are you applying early?",
                           c("No"=0,"Yes"=1))))),
fluidRow(    column(12,align="center",selectInput("college", "What college are you applying to?",
                    c("",as.character(sort(unique(dataunnormed$name))))),
              h2(uiOutput("headerText")),
              bsAlert("alert"),
              br(),
              h4(uiOutput("importancehelper")),
              plotOutput("importance"),
              br(),
              h4(uiOutput("scatterhelper")),
                conditionalPanel(condition="input.college!=''",wellPanel(
                  selectInput("xvar", "X-axis variable", better_labels, selected = "private high-school indicator"),
                  selectInput("yvar", "Y-axis variable", better_labels, selected = "GPA"),
                  tags$small(paste0(
                    "Choose different input variables to investigate how aspects of an application are related. For example, you might ask: Do private school students have better GPAs than public school students?"))),
              ggvisOutput("plot1")),
              
              br()))))

```

Finally `server.R` is the user interface back end that also invokes the webservice
```
library(shiny)
library(shinyBS)
require(ggplot2)
require(reshape2)
require(plyr)
require(curl)
library(randomForest)
library(ggvis)
library(jsonlite)


# Define server logic
shinyServer(function(input, output, session) {

 
 
 observeEvent(input$goButton, {gotime<-1
  })
  
 
  
  # get text for header
  output$headerText <- renderUI({
    
    if (input$college!=""){
      
    #translate SAT and ACT to combined score
    if (as.numeric(input$act)==0 && as.numeric(input$sat)!=0)
    {at = as.numeric(input$sat)}
    else if (as.numeric(input$sat)==0 && as.numeric(input$act)!=0)
    {
      #convert act to sat
      at = act2sat$sat[act2sat$act==as.numeric(input$act)]}
    else
    {at = 0}
    #normalize against test data
    gpa_normed<-(as.numeric(input$gpa) - normedmeans$GPA)/normedstds$GPA
    at_normed<-(at - normedmeans$admissionstest)/normedstds$admissionstest
    apave_normed<-(as.numeric(input$apave) - normedmeans$averageAP)/normedstds$averageAP
    #sat2ave_normed<-(as.numeric(input$sat2ave) - normedmeans$SATsubject)/normedstds$SATsubject
    
    
    #process values into boolean
    if (as.integer(input$race)>0)
      {race = 1}
    else {race=0}
    if (as.integer(input$hs)>0)
    {hs = 1}
    else {hs=0}
    if (as.integer(input$gender)>0)
    {fem = 1}
    else {fem=0}
    if (input$apnum==0)
    {apave_normed=0}
    
    
    pred = data.frame(admissionstest=numeric(0),
                      AP=numeric(0),
                      averageAP=numeric(0),
                      SATsubject=numeric(0),
                      GPA=numeric(0),
                      schooltype=numeric(0),
                      female=numeric(0),
                      MinorityRace=numeric(0),
                      international=numeric(0),
                      sports=numeric(0),
                      earlyAppl=numeric(0),
                      alumni=numeric(0),
                      outofstate=numeric(0),
                      acceptrate=numeric(0),
                      size=numeric(0),
                      public=numeric(0))


    pred[1,] = list(at_normed,  as.numeric(input$apnum), apave_normed,   input$sat2ave,
                    gpa_normed,   hs,   fem,
                    race,   as.numeric(input$international),   as.numeric(input$sports),   as.numeric(input$early),
                    as.numeric(input$alum),   as.numeric(input$out),   
                    datanormed$acceptrate[datanormed$name==input$college][1],   
                    datanormed$size[datanormed$name==input$college][1],
                    datanormed$public[datanormed$name==input$college][1])
    
    
    # create query string
    qs = paste0(colnames(pred),"=",pred[1,],collapse="&")
    
    server = "https://boiling-forest-8250.herokuapp.com/predict"
    
    URL = paste0(server,"?",qs)
    
    js  = fromJSON(URL)
    df = js$preds
    df$college = as.factor(df$college)
 

    
    
    #report
    str1 <- paste("<p>Our algorithm predicts you have a<br>")
    str2 <- paste("percent chance of getting in to")
    str3 <- paste("<br><br>The 95% confidence interval on this prediction is")
    str4 <- paste("to")
    HTML(str1,100*df$prob[df$college==input$college],str2,input$college)
    }
       
     })
  
  #echo importances
  output$importance<- renderPlot({
    
    if (input$college!=""){
      
      createAlert(session, "alert", "Alert", title = "This prediction is not a guarantee of admission.", 
                  "We are simply interested in exploring how application factors affect college admissions. Our algorithm is 74% accurate.", append = FALSE);
      
      drops <- c("finAidPct","instatePct")
      newimportdf<-importdf[!(rownames(importdf) %in% drops),]
      
     
      
      
    ggplot(data=newimportdf,aes(y=MeanDecreaseGini,x=reorder,fill=MeanDecreaseGini))+
      geom_bar(stat="identity")+coord_flip()+
      ggtitle(paste('Importance of Application Components'))+
      xlab(paste('Application Component'))+
      ylab(paste('Effect Size'))+
      scale_x_discrete(labels=better_labels)+
      guides(fill=FALSE)+
      theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
            panel.background = element_blank(), axis.line = element_line(colour = "black"))
    }
    
    
  })
  
  output$importancehelper <-renderUI({
    if (input$college!=""){
    HTML("This plot shows the relative importance of various parts of the application. In other words, when we train a model to predict your chance of admission, it weights the different aspects of your application according to these importances. Notice that SAT and ACT scores play the largest role in determining your chances of acceptance.")
    }
      })
  
  output$scatterhelper <-renderUI({
    if (input$college!=""){
      HTML("These graphs shows the relationship between an applicant's qualifications and their acceptance to",input$college,". Each dot represents a student in our data set; blue dots denote a student who was accepted.")
    }
  })
  
  
  #echo scatterplot
  vis<- reactive({

      # Lables for axes
      xvar_name <- colnames(subdata)[colnames(subdata) == input$xvar]
      yvar_name <- colnames(subdata)[colnames(subdata) == input$yvar]
      
      # Normally we could do something like props(x = ~GPA, y = ~SAT),
      # but since the inputs are strings, we need to do a little more work.
      xvar <- prop("x", as.symbol(input$xvar))
      yvar <- prop("y", as.symbol(input$yvar))
      
      if (input$college!='')
      {graphdata = dataunnormed[dataunnormed$name==input$college,]
      graphdata$acceptStatus[graphdata$acceptStatus==-1]<-0}
      else
      {graphdata = subdata
      graphdata$acceptStatus<-0}
      
      
      graphdata %>%
        ggvis(x = xvar, y = yvar) %>%
        layer_points(size := 50, size.hover := 200,
                     fillOpacity := 0.2, fillOpacity.hover := 0.5,
                     stroke = ~acceptStatus) %>%
        #layer_points(x = xvar, y= yvar, size := 50, fillOpacity=1,fill:="red", data = testdata)%>%
        add_axis("x", title = xvar_name) %>%
        add_axis("y", title = yvar_name) %>%
        add_legend("stroke", title = "Accepted", values = c("Yes", "No")) %>%
        #scale_nominal("stroke", domain = c("Yes", "No"),
                      #range = c("blue", "#aaa")) %>%
        set_options(width = 500, height = 500)
      
    
  })
  
  vis %>% bind_shiny("plot1")
  

  

#echo link to CS109
output$link <- renderUI({
   return(h4(p("Want to see other cool machine learning tools and projects? Visit",
                strong(a("the Harvard CS109 homepage.", href="http://cs109.github.io")))))
  

})



  ## PART THREE: alert
  output$alert <- renderUI({
   
    
        createAlert(session, "alert", "Alert", title = "This is not a guarantee of admission!", 
                    "This website is experimental and exploratory. Feel free to toggle the settings to see how your chances might change with an extra AP class or two.", append = FALSE);
     
  })

  
})

```

<a id="squarespace"></a>
## Part 3 - SquareSpace

SquareSpace hosts the static portion of the public facing web site. It also provides summary usage statistics.

<a id="references"></a>
## References

Getting Started with Python on Heroku https://devcenter.heroku.com/articles/getting-started-with-python-o#prerequisites

Buildpack for Conda on Heroku https://github.com/kennethreitz/conda-buildpack

Getting started with OpenShift and Python 2.7 (without Flask): https://developers.openshift.com/en/python-getting-started.html

Getting started with OpenShift and Flask: https://developers.openshift.com/en/python-flask.html

Blog post about OpenShift and Flask https://blog.openshift.com/day-3-flask-instant-python-web-development-with-python-and-openshift/

Somewhat dated: https://blog.openshift.com/beginners-guide-to-writing-flask-apps-on-openshift/