<a href="https://colab.research.google.com/github/natrask/ENM1050/blob/main/Code%20examples/Lecture_18.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lecture 17 - Pandas and a cookbook for a data-gathering/classification pipeline** #

# CONTRIBUTORS #

This in-class exercise is to be done in pairs. Add the names of the two students in this text block.


# Overview of today #

Today we are going to be processing *geospatial data* - data that describes the position of a sensor on the globe. To do that, we will need a library that doesn't come installed by default in colab, called *geopandas*. Pandas, and geopandas, are both libraries for easily handling data, and act as a simple layer between the data and pytorch that will make our life easier.

First, we will tell colab to install geopandas and contextily. This will take a minute.

In [None]:
# First, here is a list of libraries that we'll need.

!pip install geopandas contextily

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as ctx

Next we will see an example of what kind of processing geopandas allows us to do.

A pandas dataframe (pd.DataFrame) holds different types of data. In the short example below, it holds data corresponding to address, latitude, and longitude for two different addresses.

By wrapping it in a GeoDataFrame, we are specifying which parts of the data correspond to a geometric data point, and specify a coordinate system (in this case something called EPSG:4326). We use this to render the two points on a map of Philadelphia.

In [None]:
# Replace with your actual data
# Assuming you have a CSV file with 'address' and 'latitude', 'longitude' columns
data = pd.DataFrame({'address': ['123 Market St, Philadelphia, PA', '456 Walnut St, Philadelphia, PA'],
                     'latitude': [39.9526, 39.9500],
                     'longitude': [-75.1652, -75.1452]})

# Create a GeoDataFrame from your data
gdf = gpd.GeoDataFrame(
    data, geometry=gpd.points_from_xy(data.longitude, data.latitude))
gdf.crs = 'EPSG:4326'  # Set the coordinate system to WGS 84

# Plot the map
ax = gdf.plot(figsize=(10, 10), markersize=50, color='red')

# Add a basemap
ctx.add_basemap(ax, crs=gdf.crs, source=ctx.providers.OpenStreetMap.Mapnik)

# Customize plot (optional)
plt.title('Street Addresses in Philadelphia')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

plt.show()

In general, geospatial data can come from many sources. As we transition to working with robots and sensors in the remainder of the course, we will see that it is easy to extract latitude and longitude with a small sensor. This can be used to assist in robotic navigation, or track the location of a passive sensor (like a weather balloon) as it drifts around in the atmosphere. If other information is gathered at the same site (for example, temperature/humidity for a weather sensor), then we can attempt to use machine learning to associate measurements with either classification or prediction.

**Today's goal:** we will build up a model of the Penn campus that aims to identify what parts of campus are primarily residential vs academic.

# Introducing Pandas #

Before we do that, I'm going to introduce Pandas. Pandas makes it easy to explore very large datasets. For many datasets that you would receive in practice, they would likely come prepackaged as a pandas dataframe. In this first example, we will generate a dataframe describing four people and some demographics about them.

In [None]:
import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dennis'],
        'Age': [25, 30, 22, 18],
        'City': ['New York', 'London', 'Paris', 'Philadelphia'],
        'Income': [96000,55327,101101,42000]}
df = pd.DataFrame(data)

# Displaying the DataFrame
print(df)

The following sets of functions show how pandas let's use easily manipulate the data, gather subsets of the data, and summarize trends in the dataset.

In [None]:

# Accessing columns
print('The follow is the names in the dataset')
print(df['Name'])  # Accessing the 'Name' column
print('\n')

# Accessing rows by index
print('This is the first entry in the dataset')
print(df.loc[0])  # Accessing the first row
print('\n')

# # Filtering data
print('The following lists entries only over the age of 25:')
print(df[df['Age'] > 25])  # Filtering rows where 'Age' is greater than 25
print('\n')

# Adding a new column
print('We can add a new column - in this case the country for each')
df['Country'] = ['USA', 'UK', 'France','USA']
print(df)
print('\n')

# It is easy to grab simple properties
print('The average age by country is:')
print(df.groupby('Country')['Age'].mean())  # Grouping by 'Country' and calculating the mean of 'Age'

# # Some other useful methods:
# print(df.describe())  # Summary statistics of numerical columns
# print(df.head(2))  # Displaying the first 2 rows
# print(df.tail(2))  # Displaying the last 2 rows
# print(df.sort_values(by='Age'))  # Sorting by 'Age'



Remember when using a new library to use Gemini! It won't be able to do your homework for you, but you can ask it to explain any one of these functions.

**Your turn:** Ask gemini to explain df.groupby('Country')['Age']. Paste your explanation in the block below.

*Put stuff here.*

# Today's exercise - a complete data collection to ML pipeline #

In the last lecture we discussed *decision boundaries*. These are geometric boundaries that a classification model uses to assign each datapoint to a class. For example, if you wanted to classify a child as a newborn or an toddler, one could look at the childs weight and height as a clear indicator. We saw this last lecture as we used flower stems and petals as a way to identify flower species.

Today we will develop a system that will split the Penn campus up into residential and non-residential areas. To do this, we will generate a list of latitude/longitude locations of buildings and label them as either a *dorm* or an *academic building* (meaning a lecture hall, laboratory, etc).

We will do this by constructing a google form. The reason for this is that in your next HW assignment you will build a survey to collect data to perform a classification analysis, and you will be able to use this exercise with minor changes.

**Your turn.**
1. Open maps.google.com and search for the UPenn campus.
2. Choose a random dorm on campus (doesn't need to be yours). Right click it on the google map to get a longitude and latitude.
3. Open up the following survey: https://forms.gle/TZ5gedEEHYo9wGmB6
4. Enter the longitude and latitude, and mark as a *dorm*.
5. Repeat the process but for a random academic building, and mark as a *not dorm*.

# Process dataset into a pandas dataframe #

In what's below, we will load in the output of the survey from a corresponding googlesheet, and put it into a pandas dataframe.

In [None]:
from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

In [None]:
spreadsheet_url = 'https://docs.google.com/spreadsheets/d/1PNweaNkwxdmCo84-1duahGVg1uelcsSS0ZaL45EVIiY/edit?gid=1824420767#gid=1824420767'
sh = gc.open_by_url(spreadsheet_url)  # or gc.open_by_key(spreadsheet_key)
worksheet = sh.get_worksheet(0)

From early in the semester, we had shown how we could loop over and process that data using lists. For example:

In [None]:
# Get all values from the worksheet as a list of lists
data = worksheet.get_all_values()
for entry in range(1,len(data)):
  lat_entry = data[entry][1]
  lon_entry = data[entry][2]
  if data[entry][3] == 'Dorm':
    label_entry = 1
  else:
    label_entry = 0

  print(lat_entry,lon_entry,label_entry)

To instead stick it in a dataframe, we first define a list of strings defining the types of data in columns, and then loop through the data pushing new entries into the database one at a time.

In [None]:
import pandas as pd

# Create an empty DataFrame with specified column names
columns = ['latitude', 'longitude', 'survey_label']
df = pd.DataFrame(columns=columns)

# Get all values from the worksheet as a list of lists
data = worksheet.get_all_values()
for entry in range(1,len(data)):
  lat_entry = float(data[entry][1])
  lon_entry = float(data[entry][2])
  if data[entry][3] == 'Dorm':
    label_entry = 1
  else:
    label_entry = 0

  df.loc[len(df)] = [lat_entry, lon_entry, label_entry]

The following shows how easy it is now to pull out arrays corresponding to different pairs of values.

In [None]:
print('We can pull out pairs of latitude and longitude by listing column names in square brackets')
print(df[['latitude','longitude']])
print('\n')

print('We can pull these out in numpy format by appending .values at the end.')
print(df[['latitude','longitude']].values)
print('\n')

**Your turn.** Modify the code in the above block to print out a numpy array corresponding to:
1. just latitudes
2. latitudes and label_entry
3. latitutude, longitudes, and label_entry

In [None]:
# Put stuff here

## PyTorch cookbook step 1 - Load data ##

First we will visualize the data. This is where we need to use gpd. Don't work about the details of this, although you can click the Gemini button in the top right of colab and ask Gemini to "explain what a geodataframe is" to learning more. For now though, you can take this as a visualization showing where we have collected data.

In [None]:
# prompt: superimpose a scatter plot of street addresses over a map of philadelphia

# !pip install geopandas contextily

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as ctx

# Replace with your actual data
# Assuming you have a CSV file with 'address' and 'latitude', 'longitude' columns
data = df

# Create a GeoDataFrame from your data
gdf = gpd.GeoDataFrame(
    data, geometry=gpd.points_from_xy(data.longitude, data.latitude))
gdf.crs = 'EPSG:4326'  # Set the coordinate system to WGS 84

# Plot the map
# ax = gdf.plot(figsize=(10, 10), markersize=50, color='red')
ax = gdf.plot(figsize=(10, 10), markersize=50, column='survey_label', cmap = 'Spectral', legend=False)

# Add a basemap
ctx.add_basemap(ax, crs=gdf.crs, source=ctx.providers.OpenStreetMap.Mapnik)

# Customize plot (optional)
plt.title('Charting out buildings around Penn')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

plt.show()

Next we need to stick this data into a pytorch tensor. You'll see here what's so nice about pandas - we can easily grab different inputs/outputs from the database and drop them in.

We will want to find a model that predicts *survey_label* from *latitude* and *longitude*. That means that X_in (the input to the model) should be a size [Ndata,2] tensor, and we should process the survey_labels into a size [Ndata,2] one-hot tensor.

As mentioned in last class, we always want to rescale our data so that the inputs of neural networks are between 0 and 1, or gradient descent won't work well.

In [None]:
# prompt: load latitude and longitude into pytorch tensor
import numpy as np
import torch

# Assuming your DataFrame 'df' has 'latitude' and 'longitude' columns
X_in = torch.tensor(df[['latitude', 'longitude']].values,dtype=torch.float32)
y_data = torch.tensor(df['survey_label'].values)

# Rescale X_in to [0,1]
datamin = X_in.min(dim=0)[0]
datamax = X_in.max(dim=0)[0]
X_in = (X_in - datamin) / (datamax - datamin)

#convert to one-hot encoding, considering we have 2 classes
y_data_onehot = torch.zeros(y_data.shape[0],2)
y_data_onehot[torch.arange(y_data.shape[0]),y_data.long()] = 1

print("Latitude Tensor:", X_in[:,0])
print("Longitude Tensor:", X_in[:,1])
print('One hot of classes:', y_data_onehot)

## Step 2 - Build PyTorch model ##

We can copy and paste from one of the classification example networks in the last class. For this example, we need to take in two inputs and output two class labels. I made a neural network with *two hidden layers* - this will be more powerful than the single hidden layer we've used so far.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

class dorm_ClassificationMLP(nn.Module):
    def __init__(self):
        super(dorm_ClassificationMLP, self).__init__()
        self.Nneurons = 10                        # **ten** internal neurons
        self.hidden = nn.Linear(2, self.Nneurons) # **two** input neurons
        self.relu = nn.ReLU()
        self.hidden2 = nn.Linear(self.Nneurons, self.Nneurons) # **ten** input to **ten** output
        self.relu2 = nn.ReLU()
        self.output = nn.Linear(self.Nneurons, 2) # **two** output neurons
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
        return self.softmax(self.output(self.relu2(self.hidden2(self.relu(self.hidden(x))))))

## Step 3 - Initialize model and optimizer ##

This part is standard - just copied and pasted again from last class.

In [None]:
#Initialize the model
model = dorm_ClassificationMLP()

# Define the loss function
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

## Step 4 - Train the model ##

Again - this is standard and copied from the last class examples.

In [None]:

num_epochs = 1000
for epoch in range(num_epochs):
    optimizer.zero_grad()
    y_out = model(X_in)
    loss = criterion(y_out, y_data_onehot)
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
        print('Epoch:', epoch, 'Loss:', loss.item())

## Step 5. Post-process results. ##

First of all, we'll check a few predictions and make sure that the output probability is similar to the training probability.

In [None]:
with torch.no_grad():
    y_out = model(X_in)
    for i in range(X_in.shape[0]):
        print('(predicted probability/true probability): ',y_out[i,:].detach().numpy(),y_data_onehot[i,:].detach().numpy())


Next, I want to visualize the decision boundary that it's using to split dorms and non-dorms apart. To do this, I'm going to copy and paste the code from last class in order to build a contour plot with the data overlaid.

In [None]:
# Build a contour plot on a
x_min, x_max = X_in[:, 0].min() - 0.2, X_in[:, 0].max() + 0.2
y_min, y_max = X_in[:, 1].min() - 0.2, X_in[:, 1].max() + 0.2

#generate a grid of points between min and max
xx, yy = torch.meshgrid(torch.linspace(x_min, x_max, 200), torch.linspace(y_min, y_max, 200))
X_grid = torch.stack([xx.flatten(), yy.flatten()], 1)

#calculate the model output for the grid
with torch.no_grad():
    y_out = model(X_grid)
    y_out = torch.argmax(y_out, dim=1)

# Create a contour plot
plt.contour(xx.numpy(), yy.numpy(), y_out.view(200, 200).numpy(), alpha=0.5)

# Add scatter plot of the data points (optional)
plt.scatter(X_in[:, 0].numpy(), X_in[:, 1].numpy(), c=y_data.numpy(), cmap='viridis')

# Add labels and title (optional)
plt.xlabel('Latitude')
plt.ylabel('Longitude')
plt.title('Decision Boundary for Dorm Classification')

plt.show()


In [None]:
# Assuming you have a CSV file with 'address' and 'latitude', 'longitude' columns
data = df
gdf = gpd.GeoDataFrame(
    data, geometry=gpd.points_from_xy(data.longitude, data.latitude))
gdf.crs = 'EPSG:4326'  # Set the coordinate system to WGS 84

#Rescale the data back into lat/long coordinates and plot
scaledX = X_grid[:,0]*(datamax[0]-datamin[0])+datamin[0]
scaledY = X_grid[:,1]*(datamax[1]-datamin[1])+datamin[1]
contour_data = pd.DataFrame({'latitude': scaledX, 'longitude': scaledY, 'prediction': y_out.numpy()})
contour_gdf = gpd.GeoDataFrame(
    contour_data, geometry=gpd.points_from_xy(contour_data.longitude, contour_data.latitude))
contour_gdf.crs = 'EPSG:4326'

# # Plot the map with the contour plot superimposed
ax = contour_gdf.plot(figsize=(10,10), markersize=1, column='prediction', cmap='Spectral', alpha=0.3, legend=False)
gdf.plot(ax=ax, markersize=50, column='survey_label', cmap='Spectral', legend=False)

# # Add a basemap
ctx.add_basemap(ax, crs=gdf.crs, source=ctx.providers.OpenStreetMap.Mapnik)

# # Customize plot (optional)
plt.title('Charting out buildings around Penn')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

plt.show()

We can also plot the decision boundary more explicitly in the unscaled coordinates so we can compare more easily to the map.

In [None]:
# Generate contours of decision boundary
scaledxx = (xx*(datamax[0]-datamin[0])+datamin[0]).numpy()
scaledyy = (yy*(datamax[1]-datamin[1])+datamin[1]).numpy()
plt.contourf(scaledxx, scaledyy, y_out.view(200, 200).numpy(), alpha=0.5)

print(data['latitude'].values)
plt.scatter(data['latitude'].values, data['longitude'].values, c=y_data.numpy(), cmap='Spectral')
plt.xlabel('Latitude')
plt.ylabel('Longitude')
plt.show()
