<a href="https://colab.research.google.com/github/nolgalindo/Node/blob/main/Copy_of_viz_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualization

In [None]:
import pandas as pd
import numpy as np

In [None]:
airbnb = pd.read_csv('https://raw.githubusercontent.com/ishaandey/node/master/week-4/workshop/airbnb.csv') 

As with all new datasets, let's start by familiarizing ourselves with the dataset:

**Try it!** Print the shape, columns, and show a sample observation

In [None]:
airbnb.shape

(7237, 7)

In [None]:
airbnb.sample(5) #pulls a random row, or 5 random rows

Unnamed: 0,neighbourhood,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
592,Wallingford,185,31,13,0.19,2,269
6683,North Admiral,50,2,10,2.52,1,54
1701,Broadway,160,2,40,0.85,59,232
828,Whittier Heights,112,1,190,3.16,1,327
7198,Pike-Market,70,30,0,,2,205


## Scatter, Bars, and Histograms: The Basics

Some imports: Note that we'll rename `plotly express` as `px`.

`plotly express` is a "wrapper" for the base `plotly` package. What that means is we can use incredibly easy and readable functions, and plotly express will do the hard work of convering that input into formats that the software can understand.

Quick aside: If you're a web developer and love JS, or an academic and use R, the same Plotly library is available to use in both languages. 

In [None]:
import plotly
import plotly.express as px

Our overarching goal: **What are the average prices in each neighborhood?**

In order to get a handle on the data, let's use a histogram to show single distributions. How would you know what function to use?  *Google*

In [None]:
px.histogram(x = airbnb.price)

There's a couple of outliers here screwing with the distribution. Any idea on how to fix it?

It works, but doesn't really tell us too much. Let's modify the plot by adding some parameters.

With *any* python package, we can pull up some quick documentation from Jupyter itself using `?`
<br>**Try it!** What parameters does `px.histogram` accept?

In [None]:
px.histogram?

In [None]:
px.histogram(x = airbnb.price, range_x=(0,1000))

How does this differ by neighborhood? Use 'neighbourhood_group' as a breakout

In [None]:
df = airbnb.room_type
px.histogram(x = airbnb.price, range_x=(0,1000), color=df, barmode='overlay')

AttributeError: ignored

### Quick aesthetics

Plotly is interactive! Play around with the legends and plot area. <br>Double click on the legend icon on the right, and plotly will automatically update the figure to select those points only.

Let's use some other colors. There are two main types: **discrete sequences** and **continuous scales**. As you can imagine, if the data you're interested in has distinct groups (i.e. neighborhoods), you'd be interested in using `color_discrete_sequence=`. If the latter, use `color_continuous_sequence=`. 

How do you know what options are available? Plotly has several default options for each type.
You can check them out using `px.colors.qualitative.swatches()` for discrete options or `px.colors.sequential.swatches()` for continuous scales.

As a reminder, the `color=` parameter only breaks the graph into different colors, based upon the *attribute* (column) given. To actually change the values, we need to specify a set of colors to another parameter.  

We can change our colors fairly easily using color scales.

<br>If the feature we pass to `color=` is **discrete or categorical**, we'll add the `color_discrete_sequence` param
* Documentation for the color schemes accepted: https://plotly.com/python/discrete-color/

<br>If the feature is instead **continuous**, we'll use the `color_continuous_scale` param instead
* The corresponding docs for continuous schemes: https://plotly.com/python/colorscales/
* And the available color scales:  https://plotly.com/python/builtin-colorscales/

<br> Open the docs, and try out your favorite below:

In [None]:
df = airbnb.room_type
px.histogram(x = airbnb.price, range_x=(0,1000), color=df, barmode='overlay', color_discrete_sequence= ["green", "blue", "goldenrod", "magenta"] )

AttributeError: ignored

Finally, we can add labels to our charts as dictionaries, in the form of `{'column_name':'Column Name', 'another_name':'Another Name'}`

In [None]:
df = airbnb.room_type
px.histogram(x = airbnb.price, range_x=(0,1000), color=df, barmode='overlay', color_discrete_sequence= ["green", "blue", "goldenrod", "magenta"], labels= {'color':'Room Type'})

AttributeError: ignored

Say we want to adjust this plot to show *relative* values. That is, we want to better highlight the price distributions of hotel rooms, even though they occur a lot less than enitre homes/apts.

In [None]:
px.histogram(x = airbnb.price, range_x=(0,1000), color=df, barmode='overlay', color_discrete_sequence= ["green", "blue", "goldenrod", "magenta"], labels= {'color':'Room Type'}, histnorm='probability')

In [None]:
px.colors.qualitative.swatches()
px.colors.sequential.swatches()

### Using GroupBys for Aggregation

Now, let's drop the columns that make no sense to have a median of. 

Oftentimes we'll want to create visualizations at some aggregate level. 

For example, let's say we want to show neighborhoods with a high median rental price. 
<br>Our data is at a *per-listing* level, meaning that each individual row is its own listing, with its price. 
<br>To get data at the *per-neighborhood* level, we've got to *roll up* all the listing prices per neighborhood, in other words, group the data by neighborhood, then find the median for all those listings.

In [None]:
# airbnb= airbnb.drop(columns=['name', 'host_id', 'neighbourhood_group', 'room_type', 'latitude', 'longitude'])

In [None]:
airbnb_by_neighbourhood= airbnb.groupby(by =[ 'neighbourhood_group', 'neighbourhood']).agg('median').reset_index() #reset index makes sure that neighborhood shows up as a column

In breakout groups, see if you can **build a bar plot** to show median prices in each neighbourhood group, and sort them in a meaningful way

Make it complete! Label axes, hover text, color, the whole nine yards.

In [None]:
bar_plot = px.bar(airbnb_by_neighbourhood, x='neighbourhood', y='price', color='neighbourhood_group')

Say my friend and I have a budget of of $90 per night. Show which regions are ideal for this, but how you wanna do that is entirely up to you: Draw a **horizontal line**, color the bars by color the ideal regions differently, as long as it communicates the which neighborhoods are generally cheaper.


**Hint**: To draw a line, use `fig.add_hline()` with corresponding parameters
<br>**Hint**: To color bars according to some condition, first **create a new column** that describes if the value is below budget.

In [None]:
#bar_plot.add_hline(y=90), this is not included in google colab yet

## Time Trends

Tabular data comes in two formats: *wide* or *long*.

Wide form puts the core observational unit as it's own row, while long-form data shows each possible data combination *as its own row*. In practice, wide data is more human-readable, while long form data tends to lend itself better for visualization tasks. 

Pandas gives us a set of functions to switch back and forth between the two formats, as needed.

Let's use the Zillow dataset to explore this further. Our overarching goal is to **show price trends in Seattle neighborhoods**.

In [None]:
zillow = pd.read_csv('https://raw.githubusercontent.com/ishaandey/node/master/week-4/workshop/zillow.csv') 

In [None]:
zillow.head() #really wide data, can't really type in a list of columns because there are so many

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,2010-09,2010-10,2010-11,2010-12,2011-01,2011-02,2011-03,2011-04,2011-05,2011-06,2011-07,2011-08,2011-09,2011-10,2011-11,2011-12,2012-01,2012-02,2012-03,2012-04,2012-05,2012-06,2012-07,2012-08,2012-09,2012-10,2012-11,2012-12,2013-01,2013-02,2013-03,2013-04,2013-05,...,2016-10,2016-11,2016-12,2017-01,2017-02,2017-03,2017-04,2017-05,2017-06,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09,2018-10,2018-11,2018-12,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09,2019-10,2019-11,2019-12,2020-01
0,271985,South End,Tacoma,WA,Seattle-Tacoma-Bellevue,Pierce County,262,1153.0,1168.0,1188.0,1180.0,,,1079.0,1059.0,1047.0,1034.0,1015.0,1011.0,1011.0,1015.0,1027.0,1030.0,1033.0,1037.0,1037.0,1030.0,1028.0,1029.0,1031.0,1033.0,1038.0,1044.0,1036.0,1035.0,1027.0,1028.0,1029.0,1032.0,1036.0,...,1296.0,1301.0,1306.0,1300.0,1296.0,1295.0,1304.0,1320.0,1335.0,1354.0,1368.0,1379.0,1383.0,1380.0,1375.0,1372.0,1374.0,1374.0,1376.0,1381.0,1386.0,1393.0,1403.0,1414.0,1421.0,1424.0,1430.0,1438.0,1433.0,1424.0,1420.0,1422.0,1427.0,1435.0,1445.0,1454.0,1465.0,1480.0,1490.0,1517.0
1,250206,Capitol Hill,Seattle,WA,Seattle-Tacoma-Bellevue,King County,320,1372.0,1401.0,1429.0,1446.0,1455.0,1458.0,1461.0,1466.0,1476.0,1485.0,1501.0,1511.0,1516.0,1520.0,1525.0,1531.0,1540.0,1554.0,1569.0,1583.0,1599.0,1607.0,1625.0,1634.0,1643.0,1651.0,1655.0,1664.0,1658.0,1647.0,1652.0,1657.0,1662.0,...,2189.0,2192.0,2189.0,2178.0,2166.0,2162.0,2159.0,2170.0,2192.0,2214.0,2235.0,2240.0,2224.0,2199.0,2181.0,2153.0,2128.0,2114.0,2100.0,2102.0,2127.0,2153.0,2169.0,2188.0,2184.0,2171.0,2167.0,2157.0,2144.0,2138.0,2138.0,2146.0,2162.0,2181.0,2199.0,2210.0,2195.0,2183.0,2192.0,2169.0
2,273587,Eastside-ENACT,Tacoma,WA,Seattle-Tacoma-Bellevue,Pierce County,356,1218.0,1242.0,1263.0,1252.0,,,1128.0,1103.0,1085.0,1070.0,1053.0,1047.0,1045.0,1047.0,1055.0,1060.0,1059.0,1055.0,1049.0,1046.0,1046.0,1056.0,1067.0,1073.0,1078.0,1079.0,1071.0,1071.0,1067.0,1064.0,1063.0,1069.0,1078.0,...,1317.0,1323.0,1331.0,1331.0,1329.0,1332.0,1347.0,1369.0,1384.0,1397.0,1408.0,1418.0,1419.0,1414.0,1404.0,1400.0,1402.0,1404.0,1406.0,1411.0,1415.0,1424.0,1438.0,1450.0,1456.0,1460.0,1464.0,1461.0,1456.0,1451.0,1451.0,1458.0,1468.0,1478.0,1486.0,1492.0,1496.0,1504.0,1519.0,1547.0
3,344035,Nevada-Lidgerwood,Spokane,WA,Spokane-Spokane Valley,Spokane County,419,919.0,939.0,950.0,958.0,932.0,,,,,754.0,749.0,755.0,766.0,775.0,779.0,775.0,770.0,770.0,773.0,775.0,775.0,777.0,780.0,782.0,784.0,781.0,782.0,782.0,780.0,783.0,783.0,786.0,790.0,...,952.0,949.0,946.0,942.0,938.0,937.0,939.0,944.0,956.0,972.0,972.0,974.0,969.0,973.0,972.0,969.0,973.0,980.0,991.0,999.0,1006.0,1008.0,1009.0,1010.0,1013.0,1021.0,1028.0,1029.0,1033.0,1035.0,1040.0,1047.0,1053.0,1057.0,1062.0,1067.0,1087.0,1096.0,1101.0,1100.0
4,272001,University District,Seattle,WA,Seattle-Tacoma-Bellevue,King County,449,1313.0,1331.0,1354.0,1368.0,1367.0,1357.0,1343.0,1337.0,1334.0,1335.0,1343.0,1359.0,1380.0,1397.0,1408.0,1412.0,1413.0,1418.0,1414.0,1417.0,1426.0,1437.0,1448.0,1460.0,1470.0,1485.0,1496.0,1504.0,1504.0,1506.0,1520.0,1536.0,1547.0,...,2036.0,2039.0,2032.0,2011.0,1999.0,1997.0,2010.0,2023.0,2039.0,2050.0,2071.0,2071.0,2061.0,2036.0,2012.0,1991.0,1987.0,1980.0,1984.0,1996.0,2032.0,2057.0,2070.0,2068.0,2065.0,2054.0,2035.0,2049.0,2049.0,2046.0,2051.0,2060.0,2081.0,2106.0,2126.0,2141.0,2087.0,2048.0,2067.0,2056.0


What regions are covered in the zillow data set? 

In [None]:
zillow.Metro.nunique()

8

Let's subset our data to just Seattle

In [None]:
zillow = zillow[zillow.City == 'Seattle']
zillow.shape

(76, 120)

### Moving from wide to long form

In [None]:
zillow.columns.values[1:10]

array(['RegionName', 'City', 'State', 'Metro', 'CountyName', 'SizeRank',
       '2010-09', '2010-10', '2010-11'], dtype=object)

A quick bit of cleaning is necessary here. Right now, each row in zillow represents a unique region, and the Zillow Rental Index (ZRI) value for each month is given in its own column (113 months = 113 columns). For visualization, we'd like each **region - month combination to be its own row**.

Look at the [Pandas cheatsheet]() to see what the relevant operation should be.

![Pandas Cheatsheet](https://github.com/ishaandey/node/blob/master/week-4/workshop/pd-reshape.png?raw=1)

Let's use `pd.melt()` to move from wide to long, making sure to google to documentation along the way. 

To make this more tangible, our goal is to convert our wide data in the form of: 
<br>`RegionName | 2010-09 | 2010-10 | 2010-11 ... ` to the long form
<br>`RegionName | Date | ZRI`.

**Hint**: RegionID through SizeRank are all *ID variables*. This means that they are unique to each observation, and should not be dropped or pivoted in the transformation. 

In [None]:
z = pd.melt(zillow, id_vars= ['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName',
       'SizeRank'], var_name='Date', value_name='z')
z

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,Date,z
0,250206,Capitol Hill,Seattle,WA,Seattle-Tacoma-Bellevue,King County,320,2010-09,1372.0
1,272001,University District,Seattle,WA,Seattle-Tacoma-Bellevue,King County,449,2010-09,1313.0
2,271990,Magnolia,Seattle,WA,Seattle-Tacoma-Bellevue,King County,725,2010-09,1632.0
3,250788,Greenwood,Seattle,WA,Seattle-Tacoma-Bellevue,King County,746,2010-09,1223.0
4,252248,Wallingford,Seattle,WA,Seattle-Tacoma-Bellevue,King County,752,2010-09,1562.0
...,...,...,...,...,...,...,...,...,...
8583,343995,Denny Triangle,Seattle,WA,Seattle-Tacoma-Bellevue,King County,4153,2020-01,2260.0
8584,344034,Riverview,Seattle,WA,Seattle-Tacoma-Bellevue,King County,4171,2020-01,2039.0
8585,251076,Lakewood,Seattle,WA,Seattle-Tacoma-Bellevue,King County,4640,2020-01,1968.0
8586,251186,Madison Park,Seattle,WA,Seattle-Tacoma-Bellevue,King County,5052,2020-01,2690.0


In [None]:
zillow.columns

Index(['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName',
       'SizeRank', '2010-09', '2010-10', '2010-11',
       ...
       '2019-04', '2019-05', '2019-06', '2019-07', '2019-08', '2019-09',
       '2019-10', '2019-11', '2019-12', '2020-01'],
      dtype='object', length=120)

In [None]:
px.line(z[z.RegionName.isin(['Denny Triangle', 'First Hill', 'Capitol Hill', 'Belltown', 'Uptown']), x='Date', y = 'z', color='RegionName'])

SyntaxError: ignored

Let's take a look at a particular region, say Capitol Hill. 

#### Interpolating Data

Because of missing datapoints, we have to interpolate some values, or make a best guess based on values before or after the gap

In [None]:
date_idx = pd.date_range(start='2011-01-01', end='2020-01-01', freq='MS')

def fill_timeseries(df):
    df2 = df.set_index(pd.to_datetime(df['Date'])).drop(columns='Date')
    df3 = df2.reindex(date_idx, fill_value=np.nan)
    return df3.reset_index()

In [None]:
ZRI2 = ZRI.groupby(by=['RegionID']).apply(fill_timeseries).reset_index(drop=True).rename(columns={'index':'Date'})
ZRI2['ZRI'] = ZRI2.ZRI.interpolate(method='linear')
ZRI2

### Making a Time Series Plot

Now that we've got data in a long format, we can easily use plotly express functions to create a time series. Say I'm interested in the following neighborhoods: `'Denny Triangle', 'First Hill', 'Capitol Hill', 'Belltown', 'Uptown'`. Plot the time trends, and make it pretty. 

## Advanced Topics: Geographic Plots

There's quite a few different ways to show geogrpahical data, usually with choropleth charts or scatter plots.
<br>Our friend Plotly has them all: https://plotly.com/python/maps/

A quick note about how this work before letting you leaf through the docs page.

<br> Most of the params in `px.scatter_mapbox()` behave pretty similarly to `px.scatter`, except that we provide **latitude and longitude data** instead of `x` and `y`. Luckily, our dataset already has that included, but oftentimes we'll have to find a lookup table online to convert city names, for example, to lat / lon coordinates. 

<br>We don't necessarily have to provide a value to `size=`, but that usually can help highlight points of interest.
<br> `zoom=` on the other hand, just changes how zoomed in the initial picture is when first loaded.

<br>Finally, we'll have to update the `mapbox_style=` parameter of the figure to a specific base map to load. 
<br>For more information on what options are available here, check out https://plotly.com/python/mapbox-layers/

Show places in Downtown, Central Area, and Capitol Hill, and highlight those under budget

### Geocoding Addresses

In [None]:
# More to come