<a href="https://colab.research.google.com/github/mwahaha-umich/ACLUFinalProject/blob/main/Copy_of_Kyle_Large_Dataset_Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Jupyter notebook is intended to assist with accessing plotly from outside of a Google Colab environment (Should technically work for any cloud environment that allows you to install NGROK). The problem with running plotly within Google Colab is that the ports that are opened by Plotly are not exposed to the outside world. In order to expose them we are going to create a network tunnel, that's where NGROK comes in.

A few things of note:
*  If you are using Google Drive for any information for this plot, you will want to copy that data "locally" before the graph will fuction externally. This is because we are starting up a new thread that will not carry over the Google Driver permissions you may have granted to the current thread.


# Download and install NGROK

First thing we need to do is download and install NGROK.

Important thing to note, if you ever get "OSError: [Errno 98] Address already in use" that means there is an instance of dash still running, the simplest solution is to restart the runtime.

In [1]:
# How to run a Dash app in Google Colab

## Requirements

### Install ngrok
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip -o ngrok-stable-linux-amd64.zip


get_ipython().system_raw('./ngrok http 8050 &')

### Install Dash
!pip install dash  # The core dash backend
!pip install jupyter_dash

#Do not install any of these. When you install Dash, it includes these. If you call the install it installs a previous version of these and will generate an error.
#!pip install dash-html-components  # HTML components
#!pip install dash-core-components  # Supercharged components
#!pip install dash-table  # Interactive DataTable component (new!)

--2021-07-23 12:19:56--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 52.45.159.115, 3.212.203.64, 3.210.213.176, ...
Connecting to bin.equinox.io (bin.equinox.io)|52.45.159.115|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13832437 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip’


2021-07-23 12:19:59 (5.96 MB/s) - ‘ngrok-stable-linux-amd64.zip’ saved [13832437/13832437]

Archive:  ngrok-stable-linux-amd64.zip
  inflating: ngrok                   
Collecting dash
  Downloading dash-1.21.0.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 5.4 MB/s 
Collecting flask-compress
  Downloading Flask_Compress-1.10.1-py3-none-any.whl (7.9 kB)
Collecting dash-core-components==1.17.1
  Downloading dash_core_components-1.17.1.tar.gz (3.7 MB)
[K     |████████████████████████████████| 3.7 MB 39.5 MB/s 
[?25hCollecting dash-html-components==1.1.4
  Downloading das

Now that we have NGROK configured, we need a way to figure out what the URL will be to access NGROK. To do this we are going to have to execute some Linux commands to pull the established tunnels.

Additional Details:

Durring testing we discovered that occassionally the Curl command would come back with nothing. we suspect this is due to the async nature of the command pulling the list of tunnels before any of them have been established. So, we have built in a wait and a couple loops to ensure that we can get a proper URL.

Additionally, every time you restart the app you will need to generate a new tunnel. So make sure this method gets called before you execute your python file to start the server.

In [2]:
import os
import json
import time


def GetPublicIP(NumberOfLoops):
  #This code only works on a linux based OS, we'll skip running it if we're on an unknown OS
  if os.name in LinuxOSList:
    NumberOfLoops = NumberOfLoops + 1
    get_ipython().system_raw('./ngrok http 8050 &')
    time.sleep(1)
    result = os.popen("curl -s http://localhost:4040/api/tunnels").read()
    resultDict = json.loads(result)
    if 'tunnels' in resultDict.keys():
      publicIP = resultDict['tunnels'][0]['public_url']
      print(publicIP)
      return publicIP
    else:
      print("Number of Loops: " + NumberOfLoops)
      if NumberOfLoops > 5:
        return "Unable to get public IP"
      return GetPublicIP(NumberOfLoops)


# Connect to Google Drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import os
from os import listdir
from os.path import isfile, join
from pathlib import Path
import requests

if 'google.colab' in str(get_ipython()):
  #We keep changing the different paths, so I'm going to check if various configurations exist.
  if (os.path.isdir("/content/drive/MyDrive/Shared with me/content/drive/My Drive/Shared with me/ACLU/")):
    GoogleDriveBase = "/content/drive/MyDrive/Shared with me/content/drive/My Drive/Shared with me/" #Anupriya
    WorkingDirectory = GoogleDriveBase + 'ACLU/' #Anupriya
  else:  
    if (os.path.isdir("/content/drive/My Drive/Projects/ACLU/")):
      GoogleDriveBase = "/content/drive/My Drive/" #Kyle
      WorkingDirectory = GoogleDriveBase + "Projects/ACLU/" #Kyle
else: # We're not running in Google Colab, which means we're probably running locally. 
  #Put code here for local copies of the files
  GoogleDriveBase = "" 
  WorkingDirectory = GoogleDriveBase + "" 


WorkingFiles = WorkingDirectory + 'WorkingFiles/'
BasePickeDrive = GoogleDriveBase + WorkingDirectory + "Pickle/"


#Make the necessary folders for the script to run.
ListOfAllRequiredDirectories = [WorkingDirectory + 'Pickle', 
                          WorkingDirectory + 'WorkingFiles',
                          WorkingDirectory + 'AdditionalData',
                          WorkingDirectory + 'ACLUData']

for folder in ListOfAllRequiredDirectories:
  RunningPath = GoogleDriveBase + folder + "/"
  Path(RunningPath).mkdir(parents=True, exist_ok=True)

If you would like to run the previous example externally, just copy and paste your code into a new block like below. 

In [5]:
import pandas as pd

df = pd.read_csv(WorkingFiles + "dfVoters.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [6]:
dfVoters_sub = df[['Voter File VANID', 'FirstName', 'MiddleName',
       'LastName', 'DWID', 'Zip5', 'Age', 
       'MaritalStatus', 'Sex', 'City','State', 'Zip4', 'Suffix', 'CD', 'SD', 'HD', 'CountyName', 'DOB', 'DateReg',
       'EthnicCatalistName', 'Party', 'RaceName', 'Primary19', '2020_Biden_Support', 'Voting_Aug_Prim',
       'PoliceAccountability','VBM_Application', 'MarijuanaConviction','Absentee_Voting', 'Mass_Incarceration']]

bins= [18,29,44,64, 123]
labels = ['18-29','30-44','45-64','65+']
dfVoters_sub['AgeGroup'] = pd.cut(dfVoters_sub['Age'], bins=bins, labels=labels, right=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [7]:
dfOutreach = pd.read_csv(WorkingFiles + "dfOutreach.csv")
dfOutreach = dfOutreach[['Voter File VANID', 'ElectionType','CommunicationType', 'File']]
outreach_voters = pd.merge(dfOutreach, dfVoters_sub, how='left', on='Voter File VANID')
outreach_dem_df = outreach_voters[['Voter File VANID', 'ElectionType', 'CommunicationType',
                                   'CountyName', 'AgeGroup', 'RaceName', 'Sex']]
outreach_dem_df.head()

Unnamed: 0,Voter File VANID,ElectionType,CommunicationType,CountyName,AgeGroup,RaceName,Sex
0,82502,scomi,text,Wayne,65+,Caucasian,F
1,85510,scomi,text,Wayne,45-64,Black,M
2,71010,scomi,text,Wayne,65+,Caucasian,F
3,44565,scomi,text,Wayne,65+,Caucasian,M
4,128077,scomi,text,Wayne,65+,Caucasian,M


In [16]:
start_coords = (46.9540700, 142.7360300)
folium_map = folium.Map(location=start_coords, zoom_start=14)
folium_map._repr_html_()


'<div style="width:100%;"><div style="position:relative;width:100%;height:0;padding-bottom:60%;"><span style="color:#565656">Make this Notebook Trusted to load map: File -> Trust Notebook</span><iframe src="about:blank" style="position:absolute;width:100%;height:100%;left:0;top:0;border:none !important;" data-html=%3C%21DOCTYPE%20html%3E%0A%3Chead%3E%20%20%20%20%0A%20%20%20%20%3Cmeta%20http-equiv%3D%22content-type%22%20content%3D%22text/html%3B%20charset%3DUTF-8%22%20/%3E%0A%20%20%20%20%3Cscript%3EL_PREFER_CANVAS%3Dfalse%3B%20L_NO_TOUCH%3Dfalse%3B%20L_DISABLE_3D%3Dfalse%3B%3C/script%3E%0A%20%20%20%20%3Cscript%20src%3D%22https%3A//cdn.jsdelivr.net/npm/leaflet%401.4.0/dist/leaflet.js%22%3E%3C/script%3E%0A%20%20%20%20%3Cscript%20src%3D%22https%3A//code.jquery.com/jquery-1.12.4.min.js%22%3E%3C/script%3E%0A%20%20%20%20%3Cscript%20src%3D%22https%3A//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js%22%3E%3C/script%3E%0A%20%20%20%20%3Cscript%20src%3D%22https%3A//cdnjs.cloudflare.com

# Build our Dashboard

In [9]:
#Set up all our filter columns
FilterColumnNames = ["RaceName", "Sex", "AgeGroup", "CountyName"]
FilterColumns = {}
import math
import numpy as np

for column in FilterColumnNames:
  FilterColumns[column] = list(outreach_dem_df[column].astype("str").replace(np.nan, 'nan', regex=True).unique())
  FilterColumns[column].append("All")
  print(FilterColumns[column])

ImportantColumnNames = FilterColumnNames.copy()
ImportantColumnNames.append(['CommunicationType', 'ElectionType'])

outreach_dem_df_nan = outreach_dem_df.copy()
outreach_dem_df_nan["AgeGroup"] = outreach_dem_df_nan["AgeGroup"].astype("str")
outreach_dem_df_nan.head()

outreach_dem_df_nan = outreach_dem_df_nan.replace(np.nan, 'nan', regex=True)
outreach_dem_df_nan = outreach_dem_df_nan.replace(np.nan, 'nan', regex=True)


df_aggregate = outreach_dem_df_nan.groupby(["RaceName", "Sex", "AgeGroup", "CountyName", 
                                        'CommunicationType', 'ElectionType']).count().reset_index()
df_aggregate = df_aggregate.rename(columns={"Voter File VANID": "Count"})

['Caucasian', 'Black', 'Unknown', 'Hispanic', 'Asian', 'Native American', 'nan', 'All']
['F', 'M', 'U', 'nan', 'All']
['65+', '45-64', '18-29', '30-44', 'nan', 'All']
['Wayne', 'Macomb', 'Oakland', 'Genesee', 'Washtenaw', 'Crawford', 'Berrien', 'Benzie', 'Jackson', 'Presque Isle', 'Marquette', 'Houghton', 'Ingham', 'Kent', 'Oscoda', 'Saginaw', 'Bay', 'Isabella', 'Lapeer', 'Clare', 'Eaton', 'Mason', 'Menominee', 'Dickinson', 'Clinton', 'Calhoun', 'St. Clair', 'Iosco', 'Ottawa', 'Chippewa', 'Monroe', 'Iron', 'Arenac', 'Wexford', 'Emmet', 'Gladwin', 'Allegan', 'Kalamazoo', 'Missaukee', 'Otsego', 'Mackinac', 'Delta', 'Hillsdale', 'Newaygo', 'Livingston', 'Branch', 'Lenawee', 'Osceola', 'Cheboygan', 'Sanilac', 'Roscommon', 'Van Buren', 'Shiawassee', 'Schoolcraft', 'Tuscola', 'Montcalm', 'Grand Traverse', 'Muskegon', 'Ogemaw', 'Leelanau', 'Alcona', 'St. Joseph', 'Midland', 'Mecosta', 'Charlevoix', 'Manistee', 'Huron', 'Cass', 'Alpena', 'Barry', 'Ionia', 'Gratiot', 'Montmorency', 'Antrim', 'O

In [10]:
#Pre aggregate the data as much as possible. Even if this only reduces your data by half it will provide a massive improvement to performance
print(str(len(df)) + " records reduce to " + str(len(df_aggregate)))
df_aggregate.head()

149854 records reduce to 3627


Unnamed: 0,RaceName,Sex,AgeGroup,CountyName,CommunicationType,ElectionType,Count
0,Asian,F,18-29,Calhoun,mail,lpv,1
1,Asian,F,18-29,Calhoun,text,lpv,7
2,Asian,F,18-29,Calhoun,text,scomi,2
3,Asian,F,18-29,Genesee,text,scomi,2
4,Asian,F,18-29,Ingham,mail,lpv,6


In [28]:
import altair as alt
import io
import dash
import dash_core_components as dcc
import dash_html_components as html
import plotly.graph_objs as go
from jupyter_dash import JupyterDash
import plotly.express as px
from vega_datasets import data
import folium
#alt.data_transformers.disable_max_rows()

# Don't need this with the cars dataset
alt.data_transformers.enable('default', max_rows=150000)

cars = data.cars()

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = JupyterDash(__name__, external_stylesheets=external_stylesheets)
app.css.append_css({'external_url':
                    'https://cdn.rawgit.com/gschivley/8040fc3c7e11d2a4e7f0589ffc829a02/raw/fe763af6be3fc79eca341b04cd641124de6f6f0d/dash.css'
                    })
app.title = 'Test dash and altair'
server = app.server

#Build our dropdowns for each of the Filter Columns
DropDowns = []
inputs = []
for ColumnName in FilterColumnNames:
    DropDowns.append(
        html.Div([
            html.Label(ColumnName),
            dcc.Dropdown(
                id=ColumnName,
                options=[{'label': i, 'value':i} for i in FilterColumns[ColumnName]],
                value='All'
            )
        ],
        style={'width': '250px', 'marginRight': 'auto',
              'marginLeft': 'auto', 'textAlign': 'center'},
        className='column'))
    inputs.append(dash.dependencies.Input(ColumnName, 'value'))

#Build our main layout
app.layout = html.Div(children=[html.Div(children=DropDowns, className='row'), html.Pre(id='plot', children="If this doesn't go away you have an error")])

@app.callback(
    dash.dependencies.Output('plot', 'children'),
    inputs
)
def pick_figure(Race, Sex, AgeGroup, CountyName):
    #This is our key. Filter everything before passing it into the charts.
    dfFiltered = df_aggregate[((df_aggregate['RaceName'] == Race) | (Race == 'All') ) & 
                              ((df_aggregate['Sex'] == Sex) | (Sex == 'All')) &
                              ((df_aggregate['AgeGroup'] == AgeGroup) | (AgeGroup == 'All')) &
                              ((df_aggregate['CountyName'] == CountyName) | (CountyName == 'All'))]

    #We should also aggregate it. This will keep our record count low and avoid errors. We will obviously need to add columns that we need.
    #dfFiltered = dfFiltered.groupby(by=['2020_Biden_Support', 'CountyName']).sum()
    #dfFiltered = dfFiltered.reset_index()

    #Encode the Altair Chart
    base = alt.Chart(dfFiltered).mark_bar().encode(
    x='Count:Q',
    y = alt.Y('CommunicationType:N', title=None),
    color='CommunicationType:N',
    row='ElectionType:N').properties(title='Election by Communication Type')

    # Save html as a StringIO object in memory
    htmloutput = io.StringIO()
    base.save(htmloutput, 'html')

    chart1 = html.Iframe(
        height='500',
        width='1000',
        #sandbox='allow-scripts',

        # This is where we will pass the html
        srcDoc = htmloutput.getvalue(),

        # Get rid of the border box
        style={'border-width': '0px'}
    )

    #Create a simple Plotly Express chart

    chart2 = dcc.Graph(
      id='example-graph',
      figure=px.histogram(
          dfFiltered, 
          x="CommunicationType", 
          y="Count", 
          histfunc='sum'),
      style={'width': '100vh', 'height': '50vh'}
    )

    #Add a folium map
    start_coords = (46.9540700, 142.7360300)
    folium_map = folium.Map(location=start_coords, zoom_start=14)
    chart3 = html.Iframe(
        height='500',
        width='1000',

        # This is where we will pass the html
        srcDoc = folium_map._repr_html_(),

        # Get rid of the border box
        style={'border-width': '0px'}
    )
    # Return the html from StringIO object
    return chart2, chart1, chart3
    #return chart2

#We can only have 1 instance of Dash running at a time.
app.run_server(mode="inline")


You have set your config to `serve_locally=True` but A local version of https://cdn.rawgit.com/gschivley/8040fc3c7e11d2a4e7f0589ffc829a02/raw/fe763af6be3fc79eca341b04cd641124de6f6f0d/dash.css is not available.
If you added this file with `app.scripts.append_script` or `app.css.append_css`, use `external_scripts` or `external_stylesheets` instead.
See https://dash.plot.com/external-resources



<IPython.core.display.Javascript object>