# Visualization 3 continued

# Correction: Shapely shapes

- `box(minx, miny, maxx, maxy)` previously mentioned as: `box(<x1>, <x2>, <y1>, <y2>)`

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import math
import requests
import re
import geopandas as gpd
import os

# new import statements


### CRS

- `<GeoDataFrame object>.crs`: gives you information about current CRS.
- `<GeoDataFrame object>.to_crs(<TARGET CRS>)`: changes CRS to `<TARGET CRS>`.

### Madison area emergency services

- Data source: https://data-cityofmadison.opendata.arcgis.com/
    - Search for:
        - "City limit"
        - "Lakes and rivers"
        - "Fire stations"
        - "Police stations"

- CRS for Madison area: https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system#/media/File:Universal_Transverse_Mercator_zones.svg

In [None]:
city = gpd.read_file("City_Limit.zip").to_crs("epsg:32616")

In [None]:
gpd.read_file("City_Limit.zip").crs

In [None]:
city.crs

In [None]:
water = gpd.read_file("Lakes_and_Rivers.zip").to_crs(city.crs)
fire = gpd.read_file("Fire_Stations.zip").to_crs(city.crs)
police = gpd.read_file("Police_Stations.zip").to_crs(city.crs)

#### Run this on your virtual machine

`sudo sh -c "echo 'Options = UnsafeLegacyRenegotiation' >> /etc/ssl/openssl.cnf"`

then restart notebook!

#### GeoJSON

How to find the below URL?

- Go to info page of a dataset, for example: https://data-cityofmadison.opendata.arcgis.com/datasets/police-stations/explore?location=43.081769%2C-89.391550%2C12.81
- Then click on "I want to use this" > "View API Resources" > "GeoJSON"

In [None]:
url = "https://maps.cityofmadison.com/arcgis/rest/services/Public/OPEN_DATA/MapServer/2/query?outFields=*&where=1%3D1&f=geojson"
police2 = gpd.read_file(url).to_crs(city.crs)

In [None]:
ax = city.plot(color="lightgray")
water.plot(color="lightblue", ax=ax)
fire.plot(color="red", ax=ax, marker="+", label="Fire")
police2.plot(color="blue", ax=ax, label="Police")
ax.legend(loc="upper left", frameon=False)
ax.set_axis_off()

Saving a `GeoDataFrame` to a `geojson` file.

- `<geodataframe object>.to_file(<relative path for .geojson file>)`

In [None]:
fire.to_file("fire.geojson")

### Geocoding: street address => lat / lon


- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long


#### Daily incident reports: https://www.cityofmadison.com/fire/daily-reports

In [None]:
url = "https://www.cityofmadison.com/fire/daily-reports"
r = requests.get(url)
r

In [None]:
r.raise_for_status() # give me an exception if not 200 (e.g., 404)

In [None]:
# doesn't work
pd.read_html(url)

In [None]:
print(r.text)

Find all **span** tags with **streetAddress** using regex.

In [None]:
re.findall(r'', r.text)

In [None]:
# Slicing the last address out to remove "City of Madison Fire Department's" address
addrs = re.findall(r'<span itemprop="streetAddress">(.*?)</span>', r.text)[:-1]
addrs = pd.Series(addrs)
addrs

#### Without city name and state name, geocoding would return match with the most famous location with such a street name.

In [None]:
geo_info = ???("1300 East Washington Ave")
geo_info

In [None]:
geo_info["address"].loc[0]

#### To get the correct address we want, we should concatenate "Madison, Wisconsin" to the end of the address.

In [None]:
gpd.tools.geocode("1300 East Washington Ave, Madison, Wisconsin")

#### Addresses with "block" often won't work or won't give you the correct lat/long. We need to remove the word "block" before geocoding.

In [None]:
gpd.tools.geocode("800 block W. Badger Road, Madison, Wisconsin")

In [None]:
gpd.tools.geocode("800 W. Badger Road, Madison, Wisconsin")

In [None]:
addrs

#### Using `str` methods on `pandas` `Series` to do manipulation.

- `<series object>.str.replace(<search str>, <replace str>)`
- always returns a new `Series` object instance - remember strings are immutable

In [None]:
addrs

In [None]:
fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, Wisconsin"
fixed_addrs

#### Using a different provider than the default one

- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
    - We will be using "OpenStreetMap", for which the argument is "nominatim"
    - We also need to specify argument to `user_agent` parameter, indicating where the request is coming from; for example: "cs320_bot"
    - Instead of processing single address, `geocode` method can also process a `Series` containing many addresses.

In [None]:
incidents = gpd.tools.geocode(fixed_addrs)
incidents

It is often a good idea to drop na values. Although in this version of the example, there are no failed geocodings.

In [None]:
incidents = incidents.dropna()
incidents

#### Self-practice

If you want practice with regex, try to write regular expression and use the match result to make sure that "Madison" and "Wisconsin" is part of each address. Utilize Piazza to post a question if you get stuck.

In [None]:
# self-practice
for addr in incidents["address"]:
    print(addr)

In [None]:
ax = city.plot(color="lightgray")
water.plot(color="lightblue", ax=ax)
fire.plot(color="red", ax=ax, marker="+", label="Fire")
police2.plot(color="blue", ax=ax, label="Police")
# Adding the incidents on to the map
incidents.to_crs(city.crs).plot(ax=ax, color="k", label="Incidents")
ax.legend(loc="upper left", frameon=False)
ax.set_axis_off()

# ML overview

#### Covid deaths analysis

- Source: https://data.dhsgis.wi.gov/
    - Specifically, let's analyze "COVID-19 Data by Census Tract V2"

In [None]:
# Do not reptitivitely download large datasets
# Save a local copy instead
if os.path.exists("covid.geojson"):
    print("Reading local file.")
    df = gpd.read_file("covid.geojson")
else:
    print("Downloading the dataset.")
    # Figure out URL to geojson
    url = ???
    # Read data from geojson URL
    df = ???
    # Write geo dataframe into a geojson file
    

In [None]:
df.head()

In [None]:
df.columns

In [None]:
# Create a geographic plot
df.plot()

### How can we get a clean dataset of COVID deaths in WI?

In [None]:
# Replace -999 with 2; 2 is between 0-4; random choice instead of using 0
df = ???
# TODO: communicate in final results what percent of values were guessed (imputed)

In [None]:
# Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"


Which points are concerning? Let's take a closer look.

#### Which rows have "DTH_CUM_CP" greater than 300?

#### Valid rows have "GEOID" that only contains digits

Using `str` methods to perform filtering: `str.fullmatch` does a full string match given a reg-ex. Because it does full string match anchor characters (`^`, `$`) won't be needed.

In [None]:
df["GEOID"]

In [None]:
df = df[df["GEOID"].str.fullmatch(r"\d+")]
df.plot.scatter(x="POP", y="DTH_CUM_CP")

### How can we train/fit models to known data to predict unknowns?
- Feature(s) => Predictions
    - Population => Deaths
    - Cases => Deaths
    - Cases by Age => Deaths
    
- General structure for fitting models:
    ```python
    model = <some model>
    model.fit(X, y)
    y = model.predict(X)
    ```
    where `X` needs to be a matrix or a `DataFrame` and `y` needs to be an array (vector) or a `Series`

### Using "POP" as feature.

In [None]:
# We must specify a list of columns to make sure we extract a DataFrame and not a Series
# Feature DataFrame


In [None]:
# Label Series


### Let's use `LinearRegression` model.

- `from sklearn.linear_model import LinearRegression`

In [None]:
model = <some model>
model.fit(X, y)
y = model.predict(X)

Predicting for new values of x.

In [None]:
predict_df = pd.DataFrame({"POP": [1000, 2000, 3000]})
predict_df

In [None]:
model.predict(???)

In [None]:
predict_df["predicted deaths"] = model.predict(predict_df)
predict_df