# UEP-0239: Python for Data Analysis and Visualization

---

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---

## Quick Overview of Matplotlib

Matplotlib works in a layered fashion. First you define your plot using `plt.plot(x, y, ...)`, then you can use additional `plt` methods to add more layers to your plot or modify its appearance. Finally, you use `plt.show()` to show the plot or `plt.savefig()` to save it to an external file. Let's see how Matplotlib works in practice by creating some trigonometric plots.

In [None]:
x = np.linspace(0, 2 * np.pi, num = 20)
y = np.sin(x)

In [None]:
plt.plot(x, y)
plt.show()

`plt.plot()` takes additional arguments that modify the appearance of the plot. See the documentation for details: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html

In [None]:
# we can specify the style of the plot using named arguments
plt.plot(x, y, color = 'red', linestyle = '--', marker = 'o')
plt.show()

In [None]:
# or we could use a shorthand string
plt.plot(x, y, 'r--o')
plt.show()

We can easily add additional layers and stylistic elements to the plot.

In [None]:
plt.plot(x, y, 'r--o')
plt.plot(x, np.cos(x), 'b-*')
plt.title('Sin and Cos')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['sin', 'cos'])
plt.show()

Note that if we only supply one array as an input to `plt.plot()`, it uses the values of the array as `y` values and uses the indices of the array as `x` values.

In [None]:
plt.plot([2, 3, 6, 4, 8, 9, 5, 7, 1])
plt.show()

If we want to create a figure with several subplots, we can use `plt.subplots()` to create a grid of subplots. It takes the dimensions of the subplot grid as input *`plt.subplots(rows, columns)`* and returns tow objects. The first is a figure object and the second is a NumPy array containing the subplots. In Matplotlib, subplots are often called *axes*.

In [None]:
# create a more fine-grained array to work with
a = np.linspace(0, 2 * np.pi, num = 100)

In [None]:
# create a two-by-two grid for our subplots
fig, ax = plt.subplots(2, 2)

# create subplots
ax[0, 0].plot(a, np.sin(a))     # upper-left
ax[0, 1].plot(a, np.cos(a))     # upper-right
ax[1, 0].plot(a, np.tan(a))     # bottom-left
ax[1, 1].plot(a, -a)            # bottom-right

# show figure
plt.show()

A more MATLAB-esque way of creating subplots would be to use the alternative `plt.subplot()` method. Using this method, you can define subplot using a three-number combination `plt.subplot(rows, columns, index)`. The indexes of the subplots defined using this method increase in ***row-major*** order and, in true MATLAB fashion, begin with one.

In [None]:
plt.subplot(2, 2, 1)    # upper-left
plt.plot(a, np.sin(a))
plt.subplot(2, 2, 2)    # upper-right
plt.plot(a, np.cos(a))
plt.subplot(2, 2, 3)    # bottom-left
plt.plot(a, np.tan(a))
plt.subplot(2, 2, 4)    # bottom-right
plt.plot(a, -a)
plt.show()

---
## Working with Messy Data

In [None]:
grades = pd.read_csv('data/grades.csv')

In [None]:
grades

In [None]:
print(grades)

### Cleaning Column Names

In [None]:
grades.rename(str.lower, axis = 'columns')

In [None]:
grades

In [None]:
grades = grades.rename(str.lower, axis = 'columns')

In [None]:
grades

In [None]:
grades.rename(columns = {'exam 1': 'exam1', 'exam_3': 'exam3'}, inplace = True)

In [None]:
grades

### Indexing and Datatypes

In [None]:
grades.dtypes

In [None]:
grades['name']

In [None]:
grades.name

In [None]:
grades[['name']]

In [None]:
grades['name'][0]

In [None]:
grades['name'][1]

In [None]:
grades.name[1]

In [None]:
grades['exam1'][1]

In [None]:
grades.exam2[1]

In [None]:
grades['exam3'][1]

In [None]:
grades.exam4[1]

In [None]:
grades

In [None]:
print(type(grades['exam3'][0]))
print(type(grades['exam3'][1]))
print(type(grades['exam3'][2]))

### Assigning Values and Working with Missing Data

In [None]:
grades['exam3'][2] = 0

**Oh no, a really scary warning!** What is happening?

Because Python uses something called *pass-by-object-reference* and does a lot of optimization in the background, the end user (that is you) has little to no control over whether thay are referencing the **original** object or a **copy**. This **warning** is just Pandas letting us know that when using *chained indexing* to write a value, the behaviour is ***undefined***, meaning that **pandas** cannot be sure wheter you are are writing to the **original** data frame or a temporary **copy**.

To learn more: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [None]:
grades

Phew, this time we got lucky. However, with a differet data frame the same approach might actually write the changes to a *temporary copy* and leave the original data frame unchanged. Chained indexing is dangerous and you should avoid using it to **write** values. What should we use instead?

There are a **lot** of options: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

- To write a singe value, use `.at[row, column]`
- To write a range of values (or a single value), use `.loc[row(s), column(s)]`

*Note that `.at` and `.loc` use row and column labels. The numbers 0, 1, 2, 3, and 4 that we see in front of the rows are actually row labels. By default, row labels match row indexes in pandas. However, quite often you will work with rows that have actual labels. Sometimes those labels might be numeric and resemble indexes, which leads to confusion and error. Hence, if you want to make sure you are using _indexes_, not labels, use `.iat` and `.iloc` instead.*

In [None]:
grades.at[2, 'exam3']

In [None]:
grades.loc[3, 'exam2'] = np.NaN

In [None]:
grades

In [None]:
grades.dtypes

In [None]:
grades['exam2'][0]

In [None]:
grades.exam3[0]

In [None]:
grades['exam2'] = pd.to_numeric(grades['exam2'])
grades['exam3'] = pd.to_numeric(grades.exam3)

In [None]:
grades

In [None]:
grades.dtypes

### Aggregating Data

In [None]:
grades['sum'] = grades['exam1'] + grades['exam2'] + grades['exam3'] + grades['exam4']

In [None]:
grades

In [None]:
grades.drop('sum', axis = 'columns', inplace = True)

In [None]:
grades

In [None]:
grades['sum'] = grades.sum(axis = 'columns')

In [None]:
grades

---
## A Better Way of Working with Messy Data

In [None]:
del grades

In [None]:
grades = pd.read_csv('data/grades.csv', na_values = 'excused')
grades

In [None]:
grades.rename(str.lower, axis = 'columns', inplace = True)
grades.rename(columns = {'exam 1': 'exam1', 'exam_3': 'exam3'}, inplace = True)
grades

In [None]:
grades.dtypes

In [None]:
grades['exam3'] = pd.to_numeric(grades['exam3'], errors = 'coerce')
grades

In [None]:
grades.dtypes

In [None]:
grades['exam3'] = grades['exam3'].fillna(0)
grades

In [None]:
grades['mean'] = grades.mean(axis = 'columns')
grades

In [None]:
grades.loc[:, 'max'] = grades.max(axis = 'columns')
grades.loc['mean'] = grades.mean(axis = 'rows')
grades.loc['max', :] = grades.max(axis = 'rows')
grades

---
## Working with Real Data

In [None]:
avocados = pd.read_csv('data/avocado.csv')

In [None]:
avocados

In [None]:
avocados.head()

In [None]:
avocados.shape

In [None]:
avocados.dtypes

### Subsetting Data using Boolean Indexing

In [None]:
avocados.geography

In [None]:
avocados.geography == 'Boston'

In [None]:
avocados[avocados.geography == 'Boston']

In [None]:
avocados_boston = avocados[avocados.geography == 'Boston']

In [None]:
avocados_boston.head(10)

In [None]:
avocados_boston_copy = avocados[avocados.geography == 'Boston'].copy()

In [None]:
avocados_boston_copy.head(10)

In [None]:
np.mean(avocados_boston.average_price[avocados_boston.year == 2019])

In [None]:
mean_2019 = np.mean(avocados_boston.average_price[avocados_boston.year == 2019])

In [None]:
print("The avereage price for avocados in the Boston area in the year 2019 was: $", round(mean_2019, 2))

### Creating Plots

In [None]:
plt.plot(avocados_boston.date, avocados_boston.average_price)
plt.show()

In [None]:
plt.figure(figsize = (20, 8))
plt.plot(avocados_boston.date, avocados_boston.average_price, color = 'green', linestyle = '--', marker = 'o')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()

In [None]:
avocados_boston.plot(x = 'date', y = 'average_price', figsize = (18, 8), kind='line', color = 'green')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()

In [None]:
avocados_boston[avocados_boston.year == 2019].plot(x = 'date', y = 'average_price', figsize = (18, 8), kind='line', color = 'green')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()

In [None]:
plt.hist(avocados.average_price)
plt.xlabel('Price')
plt.show()

In [None]:
sns.histplot(avocados.average_price, color = 'r', kde = True)

In [None]:
sns.histplot(avocados.average_price[avocados.year == 2019], color = 'r', kde = True)

In [None]:
sns.histplot(avocados.average_price[avocados.geography == 'Boston'], color = 'r', kde = True)

---

## Combining Datasets and Long vs Wide Data

In [None]:
pop = pd.read_csv('data/population.csv')

In [None]:
pop

In [None]:
pop.drop(labels = ['Indicator Name', 'Indicator Code'], axis = 1, inplace = True)

In [None]:
pop

In [None]:
pop_long = pop.melt(id_vars = ['Country Name', 'Country Code'], var_name = 'Year', value_name = 'Population')

In [None]:
pop_long

In [None]:
gdp = pd.read_csv('data/gdp.csv').drop(labels = ['Indicator Name', 'Indicator Code'], axis = 1)

In [None]:
gdp

In [None]:
gdp_long = gdp.melt(id_vars = ['Country Name', 'Country Code'], var_name = 'Year', value_name = 'GDP')

In [None]:
gdp_long

In [None]:
countries = pop_long.merge(gdp_long.drop(labels = ['Country Name'], axis = 1), 
                           on = ['Country Code', 'Year'], how = 'inner')

In [None]:
countries

In [None]:
countries['GDP per capita'] = countries['GDP'] / countries['Population']

In [None]:
countries

In [None]:
random_countries = np.random.choice(countries['Country Code'].unique(), 10)

In [None]:
countries_select = countries[countries['Country Code'].isin(random_countries)]

In [None]:
countries_select

In [None]:
for name, data in countries_select.groupby('Country Name'):
    data.plot(x = 'Year', y = 'GDP per capita', label = name, figsize = (18, 8), ax = plt.gca())
plt.show()

---
## Grouping, Resampling, and Working with Timeseries

In [None]:
mbta = pd.read_csv('data/mbta.csv')

In [None]:
mbta

In [None]:
mbta.dtypes

In [None]:
mbta['time_period'] = mbta['time_period'].str.strip('()')

In [None]:
mbta

In [None]:
mbta['datestring'] = mbta['service_date'] + ' ' + mbta['time_period']

In [None]:
mbta

In [None]:
mbta.dtypes

In [None]:
mbta['datetime'] = pd.to_datetime(mbta['datestring'])

In [None]:
mbta

In [None]:
mbta.dtypes

In [None]:
mbta = mbta[['datetime', 'stop_id', 'station_name', 'route_or_line', 'gated_entries']]

In [None]:
mbta

In [None]:
mbta.set_index('datetime', inplace=True)

In [None]:
mbta

In [None]:
mbta[mbta['station_name'] == 'Davis'].plot(y='gated_entries', figsize=(18, 8), kind='line', legend=False)
plt.xlabel("Date & Time")
plt.ylabel("Gated Entries")
plt.title("Gated Entries at Davis Square")
plt.show()

In [None]:
mbta[mbta['station_name'] == 'Tufts Medical Center'].plot(y='gated_entries', figsize=(18, 8), kind='line', legend=False)
plt.xlabel("Date & Time")
plt.ylabel("Gated Entries")
plt.title("Gated Entries at Tufts Medical Center")
plt.show()

In [None]:
week = mbta['2020-01-27 00:00:00':'2020-02-02 23:59:59']

In [None]:
week

In [None]:
week[week['station_name'] == 'Davis'].plot(y='gated_entries', figsize=(18, 8), kind='line', legend=False)
plt.xlabel("Date & Time")
plt.ylabel("Gated Entries")
plt.title("Gated Entries at Davis Square")
plt.show()

In [None]:
week[week['station_name'] == 'Tufts Medical Center'].plot(y='gated_entries', figsize=(18, 8), kind='line', legend=False)
plt.xlabel("Date & Time")
plt.ylabel("Gated Entries")
plt.title("Gated Entries at Tufts Medical Center")
plt.show()

In [None]:
day = mbta.loc['2020-01-30']

In [None]:
day

In [None]:
import matplotlib.dates as mdates

In [None]:
ax = day[day['station_name'] == 'Davis'].plot( y='gated_entries', figsize = (18, 8), kind='line', legend=False)
plt.xlabel("Date & Time")
plt.ylabel("Gated Entries")
plt.title("Gated Entries at Davis Square on Thursday, January 30, 2020")
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
plt.show()

In [None]:
ax = day[day['station_name'] == 'Tufts Medical Center'].plot(y='gated_entries', figsize=(18, 8), kind='line', legend=False)
plt.xlabel("Date & Time")
plt.ylabel("Gated Entries")
plt.title("Gated Entries at Tufts Medical Center on Thursday, January 30, 2020")
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
plt.show()

In [None]:
mbta.groupby('station_name').sum().sort_values(by='gated_entries', ascending=False)

In [None]:
mbta.groupby('route_or_line').sum().sort_values(by='gated_entries', ascending=False)

In [None]:
(mbta.groupby('route_or_line')
     .sum()
     .sort_values(by='gated_entries', ascending=False)
     .plot(kind = 'bar', legend = False))
plt.xlabel("Line")
plt.ylabel("Total Gated Entries")
plt.title("Total Gated Entries by MBTA Line in Q1 2020")
plt.ticklabel_format(axis='y', style='plain')
plt.show()

In [None]:
mbta.resample('D').sum().sort_values(by='gated_entries', ascending=False)

In [None]:
ax = mbta.resample('D').sum().plot(kind = 'bar', legend = False, figsize=(18, 8))
plt.xlabel("Date")
plt.ylabel("Total Gated Entries")
plt.title("MBTA Daily Gated Entries")
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.show()

In [None]:
mbta.groupby('datetime').sum()

In [None]:
mbta.groupby('datetime').sum().plot(kind = 'line', legend = False, figsize=(18, 8))
plt.xlabel("Date")
plt.ylabel("Gated Entries")
plt.title("MBTA Gated Entires")
plt.show()

In [None]:
(mbta.groupby('datetime').sum()['2020-01-27 00:00:00':'2020-02-02 23:59:59']
     .plot(kind = 'line', legend = False, figsize=(18, 8)))
plt.xlabel("Date")
plt.ylabel("Gated Entries")
plt.title("MBTA Gated Entires")
plt.show()

In [None]:
(mbta.groupby('datetime').sum().loc['2020-01-30']
          .plot(kind = 'line', legend = False, figsize=(18, 8)))
plt.xlabel("Date")
plt.ylabel("Gated Entries")
plt.title("MBTA Gated Entires on Thursday, January 30, 2020")
plt.show()

In [None]:
mbta.groupby(['stop_id', 'route_or_line', 'station_name']).resample('D').sum()

In [None]:
mbta_daily = (mbta.groupby(['stop_id', 'route_or_line', 'station_name'])
                  .resample('D')
                  .sum()
                  .reset_index())

In [None]:
mbta_daily

In [None]:
ax = mbta_daily[mbta_daily['station_name'] == 'Davis'].plot(x='datetime',
                                                            y='gated_entries',
                                                            figsize=(18, 8),
                                                            kind='bar',
                                                            legend=False)
plt.xlabel("Date & Time")
plt.ylabel("Gated Entries")
plt.title("Gated Entries at Davis Square")
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.show()

In [None]:
mbta_daily[mbta_daily['station_name'] == 'Davis'].max()

In [None]:
mbta_daily.groupby('station_name').mean().sort_values(by='gated_entries', ascending=False)

In [None]:
mbta_daily.groupby('route_or_line').mean().sort_values(by='gated_entries', ascending=False)

In [None]:
(mbta_daily.groupby('route_or_line')
           .mean()
           .sort_values(by='gated_entries', ascending=False)
           .plot(kind = 'bar', legend = False))
plt.xlabel("Line")
plt.ylabel("Average Daily Gated Entries")
plt.title("Average Daily Gated Entries by MBTA Line in Q1 2020")
plt.ticklabel_format(axis='y', style='plain')
plt.show()

In [None]:
mbta_weekday = mbta[mbta.index.weekday < 5].copy().reset_index()

In [None]:
mbta_weekday['time'] = mbta_weekday['datetime'].dt.time

In [None]:
mbta_weekday = mbta_weekday.groupby(['stop_id', 'station_name', 'route_or_line', 'time']).mean().reset_index()

In [None]:
mbta_weekday[mbta_weekday['station_name'] == 'Davis'].plot(x='time',
                                                           y='gated_entries', 
                                                           figsize=(18, 8), 
                                                           kind='line', 
                                                           legend=False)
plt.xlabel("Time")
plt.ylabel("Gated Entries")
plt.title("Gated Entries at Davis Square on an Average Weekday in Q1 2020")
plt.show()

In [None]:
mbta_weekday[mbta_weekday['station_name'] == 'Tufts Medical Center'].plot(x='time',
                                                           y='gated_entries', 
                                                           figsize=(18, 8), 
                                                           kind='line', 
                                                           legend=False)
plt.xlabel("Time")
plt.ylabel("Gated Entries")
plt.title("Gated Entries at Tufts Medical Center on an Average Weekday in Q1 2020")
plt.show()