# Homework03

Some exercises to get started with Python, lists, dicts and datasets in Python.

## Goals

- Gain experience with a popular scripting language used for ML/AI projects and research
- Get familiar with Python's notation for lists and objects
- Experiment with Python's unique functionalities for processing lists and objects
- Learn to load and process datasets using Python

### Setup

Run the following 2 cells to import all necessary libraries and helpers for Homework 02

In [None]:
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/data_utils.py

In [None]:
import matplotlib.pyplot as plt

from Homework03_utils import Tests
from data_utils import object_from_json_url

### Exercise 01:

Working with data files.

Find the name and population of the 3 cities that are geographically closest to the world's most populous city.

# 🤔😱


#### Load Data:

Let's break this down into a few sub-problems.

First, let's load a JSON file that has information about large cities in the world.

The file at this [URL](https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/main/datasets/json/cities50k.json) has a list of cities formatted like this:

```py
{
  "name": "Pittsburgh",
  "country": "US",
  "admin1": "Pennsylvania",
  "lat": 40.4406200,
  "lon": -79.9958900,
  "pop": 304391
}
```

This is just like how we loaded ANSUR data files in class:

In [None]:
# Define the location of the json file here
CITIES_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025F-A/5020-utils/main/datasets/json/cities50k.json"

# Use the object_from_json_url() function to load contents from 
# the json file into a Python object called "info_cities"

info_cities = object_from_json_url(CITIES_FILE)

#### Exercise 01A:

Ok. We should now have a list of objects with information about cities.

Explore the data and answer the following questions:
- How many cities are in this list?
- What's the name of the first city on the list?
- What are the latitude and longitude of the last city on the list?
- What are the populations for the largest and smallest cities?
- What's the name of the city with the largest population?


In [None]:
# Work on 01A here

# How many cities are in the list?

# len() calculates the length of the list. As each row represents a different city, the len will return the total number of cities
num_cities = len(info_cities)
print("cities in the list: ", num_cities)

# What's the name of the first city on the list?

# dict[key] returns the value for a specific key in a dictionary.
# Combined with list[index], it will be restricted to that element (dictionary) of the list
first_city = info_cities[0]["name"]
print("name of first city: ", first_city)

# What are the latitude and longitude of the last city on the list?

last_latitude = info_cities[-1]["lat"]
last_longitude = info_cities[-1]["lon"]
print("latitude of last city: ", last_latitude)
print("longitude of last city: ", last_longitude)

# What are the populations for the largest and smallest cities?

# defining the key-function popKey, which receives a city dictionary and returns only the population value in order to compare them.
# This will be useful to calculate the max and min comparing only those values for each city dictionary in the info_cities list.
# The max and min could also be calculated with a loop that iterates over every city in the list to extract their pops and append them to an array,
# from which the max and min values could be selected. But in that case, the city names would get lost in the process.
def popKey(city):
    return city["pop"]

# population for the largest city 
largest_city = max(info_cities, key=popKey)
largest_population = largest_city["pop"]

# population for the smallest city
smallest_city = min(info_cities, key=popKey) 
smallest_population = smallest_city["pop"]

print("populations for largest and smallest cities: ", largest_population, smallest_population)

# What's the name of the city with the largest population?

largest_city_name = largest_city["name"]

print("city with largest population: ", largest_city_name)


#### Test 01A

In [None]:
# Test 01A
answers = [num_cities, first_city, last_latitude, last_longitude, largest_population, smallest_population, largest_city_name]

Tests.test("01A", answers)

#### Exercise 01B:

We have the largest city's name and population, but we need its position.

We can recycle some of the logic from above to get the whole object that contains information for the largest city.

In [None]:
# Work on 01B here

largest_city = max(info_cities, key=popKey)

print("object with complete information for the largest city: ", largest_city)


In [None]:
# Test 01B
Tests.test("01B", largest_city)

#### Exercise 01C:

We should have all info about the largest city here.

Now, we'll iterate through the list and use each city's latitude and longitude to calculate its distance from the largest city.

Althought not $100\%$ correct, it's ok to use the [2D Euclidean distances](https://en.wikipedia.org/wiki/Euclidean_distance#Two_dimensions) for this.

Could be useful to define a function `distance(cityA, cityB)` that returns the distance between two cities.

In [None]:
# Work on 01C here

# Implement the helper function for calculating distances between 2 cities

# importing math to use math.sqrt 
import math

def distance(cityA, cityB):
    latA = cityA["lat"]
    lonA = cityA["lon"]    
    latB = cityB["lat"]
    lonB = cityB["lon"]
    latdiff = latA - latB
    londiff = lonA - lonB

    distance = math.sqrt((latdiff ** 2) + (londiff ** 2))

    return distance


In [None]:
# Test 01C
Tests.test("01C", distance)

#### Exercise 01D:

Ok. We implemented a function to calculate the distance between 2 cities, let's use it now.

Iterate through the list of cities again, calculate the distance from each city to the largest city, and add that as a new feature/key to each city's entry:

```py
{
  "name": "Pittsburgh",
  "country": "US",
  "admin1": "Pennsylvania",
  "lat": 40.4406200,
  "lon": -79.9958900,
  "pop": 304391,
  "distance": 1222.32
}
```

Just make sure the key that holds the distance value is called `distance`.

In [None]:
# Work on 01D here

# Now calculate every city's distance from the largest city and
# add that info to each city's entry or save that on a new list
# with their name and pop

# defining a new empty list for storing each city's name, pop, and distance to the largest city
city_distances = []
# looping through every city
for city in info_cities:
    # adding distance to largest city as a new key-value pair in the info_cities list
    city['distance'] = distance(city, largest_city)
    # also appending these elements to the list city_distances (which will be a list of objects)
    city_distances.append({"name" : city['name'], "pop" : city['pop'], "distance" : city['distance']})


In [None]:
# Test 01D
Tests.test("01D", info_cities)

#### Exercise 01E:

Now, sort the array from the previous step by distance and get the name and population of the $3$ cities closest to the largest city, but not including the largest city. In other words, if you sort the list from the exercise above by ascending `distance`, the $3$ cities closest to the largest city will be in the slice `[1:4]`. The city at index $0$ is the city with the largest population, and should have a distance of $0$ from itself.

The answer should be an object where its keys are city names and values are populations.

Something like:

```python
closest_3 = {
  "pittsburgh": 23412,
  "liverpool": 172821,
  "oakland": 182726
}
```

We saw how to sort lists of objects in lecture.

In [None]:
# Work on 01E here

# Sort the array and get the name and population of the 3 cities closest to the largest city

# creating a key-function to sort the list of objects city_distances by distance
def distanceKey(city):
    return city["distance"]

# sorting the list in ascending order
cities_by_distance = sorted(city_distances, key = distanceKey)
print(cities_by_distance)
closest_3 = {cities_by_distance[1]["name"] : cities_by_distance[1]["pop"],
             cities_by_distance[2]["name"] : cities_by_distance[2]["pop"],
             cities_by_distance[3]["name"] : cities_by_distance[3]["pop"]}


In [None]:
# Test 01E
Tests.test("01E", closest_3)

### Exercise 02:

Visualizing data files.


#### Loading The Data:

Let's load a JSON file that has information about houses in the Los Angeles metropolitan region of California.

The file at this [url](https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/main/datasets/json/LA_housing.json) has a list of objects formatted like this:

```python
{
  "longitude": -114.310,
  "latitude": 34.190,
  "age": 15,
  "rooms": 12.234,
  "bedrooms": 3.514,
  "value": 669000
}
```

The number of rooms and bedrooms are not integers because some addresses have multiple units/apartments with different floorplans that get averaged.

In [None]:
# Define the location of the json file
HOUSES_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025F-A/5020-utils/main/datasets/json/LA_housing.json"

# Use the object_from_json_url() function to load
# the json file into a Python object called "info_houses"

info_houses = object_from_json_url(HOUSES_FILE)

#### Exercise 02A:

Explore the data and answer the following questions:
- How many instances are there in our dataset?
- What's the value of the most expensive house?
- What's the max number of bedrooms in a house?
- What's the number of bedrooms in the house with the most rooms?
- What's the number of rooms in the house with the most bedrooms?

In [None]:
# Work on 02A here

# How many instances are there in our dataset?
# This is the same as asking "how many rows" or, in this case, "how many houses"

num_houses = 0


# What's the value of the most expensive house?

max_value = 0


# What's the number of bedrooms in the house with the most bedrooms?

most_bedrooms = 0


# What's the number of bedrooms in the house with the most rooms?

bedrooms_in_most_rooms = 0


# What's the number of rooms in the house with the most bedrooms?

rooms_in_most_bedrooms = 0


In [None]:
# Test 02A
answers = [num_houses, max_value, most_bedrooms, bedrooms_in_most_rooms, rooms_in_most_bedrooms]

Tests.test("02A", answers)

#### Exercise 02B:

Which of the features (`longitude`, `latitude`, `age`, `rooms` or `bedrooms`) is a better indicator for value of a house?

We're going to use XY scatter plots to visualize house value as a function of each of these features, and see if any of them show strong correlation.

Documentation for the plotting library is here:
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

One thing to note is that the functions for plotting like to get lists of values and not lists of objects.

Before we plot anything, let's define a function `list_from_key(objs, key)` that returns a list with all values of feature `key` for all of the objects in `objs`.

In [None]:
# Work on 02B here

# helper function to get lists of values from specific key
def list_from_key(objs, key):
  # TODO: implement the list_from_key functionality
  return []


In [None]:
# Test 02B
Tests.test("02B", list_from_key)

#### Exercise 02C:

Now we can actually plot some values and start looking for correlations.

Pick a feature and make a graph that shows house prices as a function of that feature.

You can also write a for loop to plot graphs for all features.

In [None]:
# Work on 02C here

# TODO: get a list with all of the price values
prices = []

# TODO: get a list with all of the house ages (for example)
house_ages = []


# this is the command to plot a XY scatter plot from 2 lists
# see documentation: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

plt.scatter(house_ages, prices)
plt.xlabel("age")
plt.ylabel("value")
plt.show()

#### What are some features that correlate with price ?

### <span style="color:hotpink">Interpretation</span>

<span style="color:hotpink">Answer 02C here. Double-click to edit Markdown.</span>

#### Foreshadowing:

What if we use two features at a time?

Is there a pair of features (`longitude`, `latitude`, `age`, `rooms` or `bedrooms`) that correlates to house value?

We could look at the relationship between `price`, `age` and `rooms`:

In [None]:
# get a list with all of the price values
prices = list_from_key(info_houses, "value")

# get a list with all of the values of one feature
feature_0_values = list_from_key(info_houses, "age")

# get a list with all of the values of another feature
feature_1_values = list_from_key(info_houses, "rooms")

# this is how we plot an XY scatter plot using 3 lists
# see documentation: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

plt.scatter(feature_0_values, feature_1_values, c=prices, alpha=0.3)
plt.xlabel("age")
plt.ylabel("rooms")
plt.show()

Or, we could write a little for loop to plot all possible pairs of features:

In [None]:
# to plot all feature pairs
# get list of all features
features = info_houses[0].keys()
prices = list_from_key(info_houses, "value")

# get all pairs of features
for idx_0, feature_0 in enumerate(features):
  x = list_from_key(info_houses, feature_0)
  for idx_1, feature_1 in enumerate(features):
    y = list_from_key(info_houses, feature_1)
    # skip repeated features
    if feature_0 != "value" and feature_1 != "value" and idx_1 > idx_0:
      plt.scatter(x, y, c=prices, alpha=0.3)
      plt.xlabel(feature_0)
      plt.ylabel(feature_1)
      plt.show()