<a href="https://colab.research.google.com/github/jim-umich/jnorb-onguetou/blob/main/colab/streamlit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install streamlit



In [None]:
!pip install pyngrok



In [None]:
! pip install streamlit-folium



In [None]:
# from google.colab import drive
# import os
# drive.mount('/content/drive')

In [None]:
# os.chdir('drive/Shareddrives/CapstoneProject/AoT_Chicago Reduced Data_V3')

In [77]:
%%writefile app.py
import streamlit as st
import pandas as pd
import numpy as np

import gzip
from google.colab import drive
import os

from PIL import Image

import matplotlib.pyplot as plt
import altair as alt
# from streamlit_folium import folium_static
import folium
import seaborn as sb

from statsmodels.tsa.api import VAR
from statsmodels.tsa.vector_ar.var_model import VARResults, VARResultsWrapper
# statsmodels.tsa.vector_ar.var_model.VARProcess.forecast_interval

# Suppress warnings
import warnings
from statsmodels.tools.sm_exceptions import ValueWarning
warnings.simplefilter("ignore", ValueWarning)

# st.title("MADS 697/698 - CAPSTONE")
# st.header("Diane O and James N")


def main():
  # "# MADS CAPSTONE, SIADS 697-698"
  st.header("MADS CAPSTONE, SIADS 697-698")
  # st.markdown('\n\n')
  st.title("Covid-19 Lockdown and Pollution in the City of Chicago?")
  # st.markdown('\n\n')
  st.header("Diane O. and James N.")
  st.markdown('\n\n')

  # menu = ["Test 1", "Test 2"]
  # choice = st.sidebar.selectbox('Menu', menu)
  # if choice == 'Test 1':
  #     st.header("This is Test 1")

  st.markdown(
    """For our project, we decided to focus on the Chicago Array of Things Dataset.  This massive dataset was collected between 2018 and 2020 from a node array installed throughout the city of Chicago.  Nodes have a variety of sensors installed at each location.  These sensors fall into one of six categories air quality, meteorological, physical, environmental, system, and vision.  For this analysis, we focus mainly on the air quality metrics.  This includes concentrations of five gasses: CO, H2S, NO2, O3, and SO2, as well as oxidizing and reducing gas concentrations. """)

  st.markdown('\n\n')
  st.header("Reducing the Data")
  st.markdown(
      """The data was initially collected and stored in a column style database with seven columns: timestamp, node id, subsystem, sensor, parameter, raw value, hrf value (processed value).  Ideally, we would have hosted this data on a cloud server and maintained the column database format by using Cassandra or a similar database schema or migrated to a relational database format such as PostgreSQL, however, the dataset was over 300GB in size and pushed the cost too high for our project.  We instead elected to write our own program to reduce the file to a manageable size. """
  )
  st.markdown(
      """After the file was un tar ed to a gzip file, the first step was to split the data into separate files based on the node that collected the data.  This was done using the bash command line and the function below:"""
  )
  st.code(
      """zcat file.csv.gz | awk -F “,” ‘{print>“subfolder/”$2“.csv”}’""", language="cli"
  )
  st.markdown(
      """This command unzips the file line by line and pipes it to the awk function.  This function splits the line by a comma and prints the line to a file with the name of value found in the second column (the node id).  This significantly reduce the size of each file we worked with but with the largest files around 10 GB, we still had to reduce the data.  Data was collected every second at each node.  We did not need this level of granularity and decided to average the readings every hour.  Theoretically, this would reduce our data by a factor of about 3600.  We completed this reduction in C# due to its speed and support for parallel processes, but a similar pipeline could be implemented in python. """
  )
  st.markdown(
      """The process we used is only possible with the data was already sorted by ascending date.  The entire code can be found on our GitHub page.   The process was to first assign a thread to read a file.  The first line in the file becomes the start time rounded to the nearest half hour rounded down (ex 4:15 would have a start time at 3:30 and 4:45 would have a start time of 4:30) and the end time is the start time plus one hour.  A dictionary of values with the parameter, subsensor, and system as the key and the total for each as the values was then created.  Once the timestamp is greater than the end time or there are no more lines in the file to read, the thread takes the mean value for each key and appends it to a new csv file with the node id.  The values are cleared and the thread continues if there are more values in the file or moves on to the next file.  The entire process takes about 30 minutes to run, and we were able to reduce the size of the entire dataset from 300 GB down to 1.8 GB.  This size is near the limits for pandas but proved to be manageable after filtering by metrics for each analysis."""
  )

  st.markdown("data from one node")
  sample_data = pd.read_csv('drive/Shareddrives/CapstoneProject/AoT_Chicago Reduced Data_V3/Numeric_001e06112e77.csv')
  st.dataframe(sample_data.head())

  st.markdown("master dafaframe")
  master_df = pd.read_csv("drive/Shareddrives/CapstoneProject/AoT_Chicago Reduced Data_V3/others/cleaned_dataset.zip")
  st.dataframe(master_df.head())

  st.markdown('nodes')
  nodes = pd.read_csv("drive/Shareddrives/CapstoneProject/AoT_Chicago Reduced Data_V3/others/nodes.csv")
  st.dataframe(nodes.head())

  st.markdown('sensors')
  sensors = pd.read_csv('drive/Shareddrives/CapstoneProject/AoT_Chicago Reduced Data_V3/others/sensors.csv')
  st.dataframe(sensors.head())


  st.markdown('\n\n')
  st.header("Data Exploration")
  st.markdown(
      """Once the data was reduced to hourly readings, we were able to explore the data using pandas.  Our initial exploration provided us with locations of the sensors and the time frame that each sensor was active.  With this information, we decided to modify our initial plan of focusing on quality of life to focus instead on a causal inference model comparing how trends in air quality changed due to Covid related lockdowns in the city."""
  )
  st.markdown(
      """We also ran an initial clustering analysis of the nodes based on the types of sensors at each location. A sparse matrix for each node with the sensor types as the values was used.  The sensor types were limited to only the subsystem types that include air quality metrics.  The dendrogram below was produced using agglomerative clustering with a distance threshold of 15.  There is clear separation between the groups meaning that there are differences in the types of data collected at these nodes.  From the bar chart, we can see that there is a cluster group that does not have any of the air quality data that we are most concerned with.  The other two groups have all of them and appear to have good coverage around the city as shown in the map."""
  )


  st.markdown('node locations')
  "# might need a static image for folium and refer back to the notebook..."
  # latlon = list(zip(nodes_df['lat'], nodes_df['lon'], nodes_df['node_id']))
  # mapit = folium.Map( location=[41.85, -87.65], zoom_start=11 )

  # for coord in latlon:
  #   folium.Marker( location=[ coord[0], coord[1] ],
  #               tooltip=('node:', coord[2], 'lat:', coord[0], 'lon:', coord[1]),
  #               #  tooltip = ''
  #               popup=coord[2]).add_to( mapit )

  # folium.TileLayer('cartodbpositron').add_to(mapit)
  # folium_static(mapit)

  # import streamlit as st
  # from streamlit_folium import folium_static
  # import folium
  # "# streamlit-folium"

  # with st.echo():
  #     import streamlit as st
  #     from streamlit_folium import folium_static
  #     import folium

  #     # center on Liberty Bell
  #     m = folium.Map(location=[39.949610, -75.150282], zoom_start=16)

  #     # add marker for Liberty Bell
  #     tooltip = "Liberty Bell"
  #     folium.Marker(
  #         [39.949610, -75.150282], popup="Liberty Bell", tooltip=tooltip
  #     ).add_to(m)

  #     # call to render Folium map in Streamlit
  #     folium_static(m)


  st.markdown('\n\n')
  st.markdown('time for data collection')

  up_df = pd.DataFrame(columns=['node_id', 'start', 'end'])
  idx = 0
  for node in nodes.node_id:
    sample = master_df[master_df.node_id == node]
    up_df.loc[idx] = [node, pd.to_datetime(sample.date).min(), pd.to_datetime(sample.date).max()]
    idx += 1

  up_df['days_up'] = (up_df.end.dt.date - up_df.start.dt.date).dt.days
  st.dataframe(up_df.head())

  st.markdown('time for data collection, description ...')
  st.dataframe(up_df.describe().T)  

  base = alt.Chart(up_df).encode(
      alt.X('node_id:N')
  ).properties(width = 1500)

  rule = base.mark_rule().encode(
      alt.Y('start:T', axis = alt.Axis(format='%m/%y', title='Date')), #,labelAngle=-45
      alt.Y2('end:T')
  )

  startpoints = base.mark_circle(size=60).encode(
      alt.Y('start:T'),
      # alt.Y2('end:T')
  )

  endpoints = base.mark_circle(size=60).encode(
      # alt.Y('start:T'),
      alt.Y('end:T'), color = alt.value("#FFAA00")
  )

  st.altair_chart(rule + startpoints + endpoints)



  st.markdown('\n\n')
  st.markdown('node sensor types')

  subsystem_types = master_df[['node_id', 'subsystem']].groupby(['node_id', 'subsystem']).count().reset_index()
  subsystem_types['count'] = 1


  sensor_chart = alt.Chart(subsystem_types).mark_tick().encode(
    x='node_id',
    y='subsystem',
    color='subsystem'
  ).properties(width=1400) #, height=250
  st.altair_chart(sensor_chart)


  st.markdown('\n\n')
  st.markdown('sensors')
  subsystem_sensor_types = master_df[['subsystem', 'sensor']].groupby(['subsystem', 'sensor']).count().reset_index()
  subsystem_types['count'] = 1

  subsystem_chart = alt.Chart(subsystem_sensor_types).mark_rect().encode(
    x='sensor',
    y='subsystem',
    # color='subsystem'
  ).properties(width=1400)
  st.altair_chart(subsystem_chart)

  filtered_subsystems = master_df[master_df['subsystem'].isin(['lightsense', 'metsense', 'chemsense', 'alphasense', 'plantower'])]
  subsystem_sensor_types = filtered_subsystems[['subsystem', 'sensor']].groupby(['subsystem', 'sensor']).count().reset_index()
  subsystem_types['count'] = 1

  filteredsub_chart = alt.Chart(subsystem_sensor_types).mark_rect().encode(
    x='sensor',
    y='subsystem',
    color='subsystem'
  ).properties(width=1400)
  st.altair_chart(filteredsub_chart)

  sensor_types_parameters = filtered_subsystems[['subsystem', 'sensor', 'parameters']].groupby(['subsystem', 'sensor', 'parameters']).count().reset_index()
  sensor_types_parameters['count'] = 1

  param_chart = alt.Chart(sensor_types_parameters).mark_rect().encode(
    x='parameters',
    y='sensor',
    color='subsystem'
  ).properties(width=1000, height=500)
  st.altair_chart(param_chart)

  st.dataframe(filtered_subsystems.head())


  pms = ['10um_particle', '1um_particle', '2_5um_particle', '5um_particle', 'pm1', 'pm10', 'pm10_atm', 'pm10_cf1', 'pm1_atm', 'pm1_cf1', 'pm25_atm', 'pm25_cf1', 'pm2_5', 'point_3um_particle', 'point_5um_particle', 'fw', 'sample_flow_rate', 'sampling_period']
  # 'concentration', 
  df_w_pms = filtered_subsystems[filtered_subsystems['parameters'].isin(pms) ].drop(['node_id', 'subsystem', 'sensor'], axis=1)

  df_w_pms = pd.pivot_table(df_w_pms, values = 'values', index = 'date', columns = 'parameters', aggfunc=np.mean).reset_index()
  # df_w_pms = df_w_pms.fillna(method="bfill")

  # df_w_pms
  st.dataframe(df_w_pms.describe())

  "# might need a static image for seaborn and refer back to the notebook..."
  fig = plt.rc('figure', figsize=(25, 10))
  sb.heatmap(df_w_pms.corr(method='pearson'), cmap='YlGnBu', annot=True)
  st.write(fig)


  st.markdown('\n\n')
  st.header("Causal Inference Analysis")
  st.markdown(
      """..."""
  )

  df = filtered_subsystems[filtered_subsystems['parameters'] == 'concentration'].drop(['node_id', 'subsystem', 'parameters'], axis=1)
  st.dataframe(df.head())

  df = pd.pivot_table(df, values = 'values', index = 'date', columns = 'sensor', aggfunc=np.mean).reset_index()
  # df = df.fillna(method="bfill")
  st.dataframe(df.head())
  st.dataframe(df.describe())

  df.date = pd.to_datetime(df.date)

  base = alt.Chart(df).mark_line().encode(x = 'date:T').properties(width=250, height=250)#.interactive()

  chart = alt.vconcat()

  row = alt.hconcat()
  for gase in ['co:Q', 'h2s:Q', 'no2:Q', 'o3:Q']:
    row |= base.encode(y=gase)
  chart &= row

  row = alt.hconcat()
  for gase in ['oxidizing_gases:Q', 'reducing_gases:Q', 'so2:Q']:
    row |= base.encode(y=gase)
  chart &= row

  st.altair_chart(chart)

  st.markdown('\n\n')
  st.header("Clustering Analysis")
  st.markdown(
      """Node clustering seemed like an obvious choice for analysis with similar data being collected around the city.  We were expecting to find some similarity between nodes with each metric that may indicate differences or similarities between neighborhoods that are not physically close to each other.  The focus of this analysis is on the air quality metrics based on the concentrations of selected gasses.  This analysis does not explore what is “good” vs “bad” air quality, it instead focuses on identifying similar concentrations of gasses over time. The results of this analysis may be interesting to correlate with other metrics such as income maps, housing prices, zoning (ex. business vs residential areas), and heath maps."""
  )
  AggloPicture = Image.open('drive/Shareddrives/CapstoneProject/AoT_Chicago Reduced Data_V3/others/AgglomerativeClusters.jpg')
  st.image(AggloPicture, caption='Agglomerative Clusters')
  st.markdown('')

  clustered_data = pd.read_csv('drive/Shareddrives/CapstoneProject/AoT_Chicago Reduced Data_V3/others/clustered_dataset.csv')
  # st.dataframe(clustered_data.head())
  def cluster_timeseries(data):
    
    cols = ['concentration_co', 'concentration_h2s', 'concentration_no2', 'concentration_o3', 'concentration_oxidizing_gases', 'concentration_reducing_gases', 'concentration_so2']
    row = None
    column = None
    for i in ['Agglomerative','DBSCAN','OPTICS', 'Spectral','AffinityPropagation']:
      row = None
      for j in cols:
        df = data.groupby(['date', i]).mean()[[j]].reset_index()
        c = alt.Chart(df).mark_line().encode(
            x=alt.X('date:T', title=''),
            y=f'{j}:Q',
            color=f'{i}:N'
        ).properties(width=200)
        if row == None:
          row = c 
        else:
          row = row|c
      if column == None:
        column = row
      else:
        column = (column&row).resolve_scale(color='independent')
    return column

  clustered_chart = cluster_timeseries(clustered_data)
  st.altair_chart(clustered_chart)

  st.subheader("Data Representation")
  st.markdown(
      """Two representation for the data were used to compare results.  The first representation averages the data over each day of the week.  The motivation for this comes from the assumption that there are differences in behavior on each weekday that may influence concentrations of gasses in the air such as differences between workday and weekend traffic (shape (number of nodes, 7xparameters)).  The second representation averages all the values by date, combining the date in previous years together (shape = (number of nodes, 366xparameters)). Both representations do not include data collected during the lockdowns to prevent any lockdown related fluctuations from skewing the results. """
  )

  st.subheader("Results")
  st.markdown(
      """The results of this analysis were not as consistent as we had hoped.  The most reliable results came from the Agglomerative clustering with the date averaging data representation.  It appears that there are differences in the cluster groups that are clear in the average time series for each cluster.  It is not clear, however, how these groups were separated.  In the map, we can see that most of the nodes belong to a single cluster.  The second largest cluster, shown in blue, appears to be much noisier than the largest orange group.  The smallest clusters appear to be the least consistent sensors and have many data gaps or were only active for a short period of time. It appears that the clusters may have only separated based on measurement consistency.  Further analysis is required to understand why these nodes are producing noisier data.  It is unclear if this is due to a faulty sensor or if there are local sources of pollution that cause the spikes.  Future projects could compare these clusters to traffic patterns and industrial centers that could be producers.  It would also be beneficial to include more variables such as wind speed and direction that have an impact on the direction of pollution plume movement.  """
  )

  

  st.markdown(
      """ """
  )

if __name__ == '__main__':
  main()



Overwriting app.py


In [None]:
!ls drive/Shareddrives/CapstoneProject/'AoT_Chicago Reduced Data_V3'/others

cleaned_dataset.csv  data.csv.gz  nodes.csv
ClusterGroups.csv    file.csv.gz  sensors.csv


In [None]:
! ls sample_data

anscombe.json		      mnist_test.csv
california_housing_test.csv   mnist_train_small.csv
california_housing_train.csv  README.md


In [None]:
! ngrok authtoken 22VlAvaxKz9JeAcG2YYyH52L5X0_6QVzv7jwtrX8X2T1gktxe

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [None]:
! ngrok

NAME:
   ngrok - tunnel local ports to public URLs and inspect traffic

DESCRIPTION:
    ngrok exposes local networked services behinds NATs and firewalls to the
    public internet over a secure tunnel. Share local websites, build/test
    webhook consumers and self-host personal services.
    Detailed help for each command is available with 'ngrok help <command>'.
    Open http://localhost:4040 for ngrok's web interface to inspect traffic.

EXAMPLES:
    ngrok http 80                    # secure public URL for port 80 web server
    ngrok http -subdomain=baz 8080   # port 8080 available at baz.ngrok.io
    ngrok http foo.dev:80            # tunnel to host:port instead of localhost
    ngrok http https://localhost     # expose a local https server
    ngrok tcp 22                     # tunnel arbitrary TCP traffic to port 22
    ngrok tls -hostname=foo.com 443  # TLS traffic for foo.com to port 443
    ngrok start foo bar baz          # start tunnels from the configuration file

VERSI

In [None]:
from pyngrok import ngrok

In [None]:
# !nohub streamlit run app.py
# ! streamlit run app.py &>/dev/null&
! streamlit run --server.port 80 app.py >&/dev/null&

In [None]:
! pgrep streamlit

3745


In [None]:
public_url = ngrok.connect(port='8501')

In [None]:
public_url

<NgrokTunnel: "http://bb6c-34-83-236-207.ngrok.io" -> "http://localhost:80">

In [None]:
! kill 2516

/bin/bash: line 0: kill: (2516) - No such process


In [None]:
! ngrok.kill

/bin/bash: ngrok.kill: command not found


In [None]:
from pyngrok import ngrok

# Setup a tunnel to the streamlit port 8501
public_url = ngrok.connect(port='8501')
public_url

2021-12-18 19:47:39.220 INFO    pyngrok.ngrok: Opening tunnel named: http-80-bd04b993-a163-412d-8240-9c00f1b2ca91




2021-12-18 19:47:40.068 INFO    pyngrok.process.ngrok: t=2021-12-18T19:47:40+0000 lvl=info msg="no configuration paths supplied"
2021-12-18 19:47:40.076 INFO    pyngrok.process.ngrok: t=2021-12-18T19:47:40+0000 lvl=info msg="using configuration at default config path" path=/root/.ngrok2/ngrok.yml
2021-12-18 19:47:40.079 INFO    pyngrok.process.ngrok: t=2021-12-18T19:47:40+0000 lvl=info msg="open config file" path=/root/.ngrok2/ngrok.yml err=nil
2021-12-18 19:47:40.083 INFO    pyngrok.process.ngrok: t=2021-12-18T19:47:40+0000 lvl=info msg="starting web service" obj=web addr=127.0.0.1:4040
2021-12-18 19:47:40.135 INFO    pyngrok.process.ngrok: t=2021-12-18T19:47:40+0000 lvl=info msg="tunnel session started" obj=tunnels.session
2021-12-18 19:47:40.137 INFO    pyngrok.process.ngrok: t=2021-12-18T19:47:40+0000 lvl=info msg="client session established" obj=csess id=f647aea08f9e
2021-12-18 19:47:40.146 INFO    pyngrok.process.ngrok: t=2021-12-18T19:47:40+0000 lvl=info msg=start pg=/api/tunnel

<NgrokTunnel: "http://672b-34-86-141-164.ngrok.io" -> "http://localhost:80">

2021-12-18 19:47:40.211 INFO    pyngrok.process.ngrok: t=2021-12-18T19:47:40+0000 lvl=info msg=end pg=/api/tunnels id=9028ea832f00b720 status=201 dur=48.988445ms
2021-12-18 19:47:40.219 INFO    pyngrok.process.ngrok: t=2021-12-18T19:47:40+0000 lvl=info msg=start pg="/api/tunnels/http-80-bd04b993-a163-412d-8240-9c00f1b2ca91 (http)" id=aface67a963d4893
2021-12-18 19:47:40.224 INFO    pyngrok.process.ngrok: t=2021-12-18T19:47:40+0000 lvl=info msg=end pg="/api/tunnels/http-80-bd04b993-a163-412d-8240-9c00f1b2ca91 (http)" id=aface67a963d4893 status=200 dur=189.591µs


In [None]:
npx localtunnel --port 8000

SyntaxError: ignored

In [None]:
npm install -g localtunnel

SyntaxError: ignored

In [None]:
!streamlit run app.py & npx localtunnel --port 8501

[K[?25hnpx: installed 22 in 2.566s
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.2:8501[0m
[34m  External URL: [0m[1mhttp://34.86.141.164:8501[0m
[0m
your url is: https://lovely-lionfish-42.loca.lt
[34m  Stopping...[0m
^C


NameError: ignored