# Instructions

Your submission will be tested with the code tester. It is important to follow these instructions to ensure your work tests properly.

- Do not change the content of the cells under __SETUP__ and __TESTS__
- Work only in the __YOUR WORK__ area
- Rename the notebook with your group at the end (subsitute XX with your group number).
- Assign the results of each numbered question to the appropriate test variable. For example, the answer of `1.` should be assigned to `test_1`
- Rounding: use the supplied function `hround` to round decimal numbers when instructed. It's important to use this function because there are [multiple ways to round numbers in Python](https://www.knowledgehut.com/blog/programming/python-rounding-numbers) and they may not result in the same value that the tester is testing against.
- Ensure your run the cells under __SETUP__ before you run your work
- Before you submit your work, ensure you clean up your notebook. Your notebook has to run without an error in order to be tested. The easiest way to ensure is to `Kernel->Restart & Run All`
- Answers are provided in along with this notebook in eLC (look a picture named `solution_key`) for your convenience
- You will need to write a program to calculate the answers. Setting the answers to be their correct values without solving them is considered *hardcoding* and will result in zero grade for the assignment as well as a potential academic honesty violation.
- You can also test your submission using [the online code tester](https://notebook-tester.safadi-puzzler.com/)


# SETUP

In [42]:
import pandas as pd
import numpy as np
import networkx as nx
import spacy
import json, os
from textblob import TextBlob
from spacy import displacy

In [43]:
from langchain.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain.callbacks.tracers import ConsoleCallbackHandler
from langchain.prompts import ChatPromptTemplate

In [44]:
from langchain.chains.openai_functions import create_structured_output_chain
from pydantic import BaseModel, Field

In [45]:
# DO NOT EDIT OR CHANGE THE CONTENT OF THIS CELL
scenario = 0
nlp = spacy.load('en_core_web_sm')
import nltk;nltk.download('punkt');nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bretttracy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/bretttracy/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [46]:
def hround(number):
    return round(number, 2 - scenario)

In [47]:
test_1=test_2=test_3=test_4=test_5=test_6=test_7=test_8=test_9=test_10=0.0
test_11=test_12=test_13=test_14=test_15=0.0

In [48]:
# insert OpenAI key here...
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

In this homework, we have data from the [100 Resilient Cities](https://www.rockefellerfoundation.org/100-resilient-cities/#:~:text=Overview%20In%202013%2C%20The%20Rockefeller%20Foundation%20pioneered%20100,a%20roadmap%20to%20resilience%20along%20four%20main%20pathways%3A?msclkid=5705866aaaaa11ec85d9402890b45f8f). The data contains 30 documents from 30 cities outlining their strategies for urban resilience. We are going to focus on text and network analytics.

In [49]:
data = pd.read_json('resillient_cities.json')
data.head()

Unnamed: 0,city,country,text,ENTITIES
0,Melbourne,Australia,﻿PEOPLE ARE\nAT THE HEART OF ALL CITIES\nA res...,"{'LOCATION': ['Collingwood', 'Sri Lankan', 'So..."
1,Rio de Janeiro,Brazil,﻿100 Resilient Cities - Pioneered by The Rocke...,"{'ORGANIZATION': ['IPCC', 'PARTNERS World Bank..."
2,Medelin,Colombia,﻿A CITY THAT TRANSFORMS ITSELF FOR ITS PEOPLE\...,"{'LOCATION': ['Colombia', 'Proantioquia', 'l.A..."
3,Vejles,Denmark,﻿Letter from the Mayor\nArne Sigtenbjerggaard\...,"{'PERSON': ['Torben Christensen', 'Burgos', 'J..."
4,Quito,Ecuador,﻿MAURICIO RODAS\nMAYOR OF THE METROPOLITAN DIS...,"{'LOCATION': ['Mexico City Cuntlupo', 'Quito',..."


In [50]:
len(data)

30

## Part 1: text analytics

1. Report the unique city names in a sorted list

2. Report the unique country names in a sorted list

3. What are the top five frequent words in `text`, return the results in a sorted list

Now we will do some text analysis. Because NLP algorithms are computationally expensive, we are going to do them on one text (so that you and the tester save time evaluating the notebook)

4. Let us focus on the `text` of the city of `Glasgow`, extract the text from the data frame then perform sentiment analysis with `TextBlob`. Which sentence in this text has the lowest (most negative) polarity?

5. Using `spacy`, perform named entity detection on Glasgow's text. Now focus on organization `ORG` entities. What are the top 5 frequent organization entities. Return the results in a series sorted by that frequency.

Now, we want to benefit from the named entities to learn more about what was discussed in these documents. To save time, I am giving you the extracted named entities in a the column `ENTITIES`. The entities  are organized as a dictionary where the keys are entity types and the values are the instances of these entities from the text.

6. Let us perform the same analysis as before but on all documents. Extract the `ORGANIZATION` entities of all documents from the column `ENTITIES`. Report the five popular organization entities in a series sorted by frequencies.

7. Using the llm provided to you and the prompt `Summarize the article below in one paragraph\n{article}`, create a summary of Glasgow's text.

8. We now want to extract `key people in city government and their roles` from Glasgow's text using `create_structured_output_chain`. Follow the steps introduced in preparation 39. First, what is a good description of `relevant_attributes`?

9. Define a class `Attributes(BaseModel)` to represent relevant attributes to be extracted with two required fields: `people` and `roles`. The `people` field is a list described as "person name," while the `roles` field is also a list, described as "person's role in city government." Both fields are mandatory. Return `Attributes.schema()`

10. Using the same prompt from preparation 39, extract and return the attributes (do not report the chain `input`).

In [51]:
test_1 = sorted(list(data['city'].unique()))

In [52]:
test_2 = sorted(list(data['country'].unique()))

In [53]:
def word_freq(list_of_words):
    return pd.Series(TextBlob(list_of_words).words).value_counts()

word_counts = data['text'].apply(word_freq)

test_3 = sorted(list(word_counts.apply(sum).sort_values(ascending=False)[:5].index))

In [54]:
glasgow_text = data[data['city'] == 'Glasgow']['text'].iloc[0]
blob = TextBlob(glasgow_text)
sentences = blob.sentences
most_negative_index = pd.Series(sentences).apply(lambda l: l.polarity).sort_values().index[0]
test_4 = sentences[most_negative_index].string

In [55]:
doc = nlp(glasgow_text)
ents = pd.Series(doc.ents)
ent_info = ents.apply(lambda e: (e.text,e.label_))
orgs = [e[0] for e in ent_info if e[1] == 'ORG']
test_5 = pd.Series(orgs).value_counts()[:5]

In [56]:
test_6 = data['ENTITIES'].apply(lambda t: [value for key,value in t.items() if key == 'ORGANIZATION']).explode().explode().value_counts().head(5)

In [57]:
test_6.name = None

In [58]:
test_6

Rockefeller Foundation        22
Resilience Strategy           20
The Rockefeller Foundation    20
NGOs                          18
CRF                           16
dtype: int64

In [59]:
prompt = ChatPromptTemplate.from_template("Summarize the article below in one paragraph\n{article}")
output_parser = StrOutputParser()
chain = prompt | llm | output_parser
test_7 = chain.invoke({"article": glasgow_text})

In [60]:
# A good description of the relevant attributes we would like to extract from Glasgow's text is
test_8 = "key people in city government and their roles"

In [61]:
class Attributes(BaseModel):
    """Relevant attributes"""
    people: list = Field(..., description="person name")
    roles: list = Field(...,description="person's role in city government")
    
test_9 = Attributes.model_json_schema()

In [62]:
test_9

{'description': 'Relevant attributes',
 'properties': {'people': {'description': 'person name',
   'items': {},
   'title': 'People',
   'type': 'array'},
  'roles': {'description': "person's role in city government",
   'items': {},
   'title': 'Roles',
   'type': 'array'}},
 'required': ['people', 'roles'],
 'title': 'Attributes',
 'type': 'object'}

In [63]:
def create_function(chain):
    def function(input):
        answer = chain(input)
        result = json.loads(answer['function'].json())
        return result
    return function

In [64]:
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a world class algorithm for extracting information in structured formats.\nYou are helping to extract information from text."),
        ("human", f"The attributes to extract are: {test_8}"),
        ("human", "Extract these attributes from the following text: {input}"),
        ("human", "Tip: Make sure to answer in the correct format"),
    ]
)
chain = create_structured_output_chain(Attributes, llm, prompt, verbose=True)


extract_attributes = create_function(chain)

In [65]:
test_10 = extract_attributes(glasgow_text)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are a world class algorithm for extracting information in structured formats.
You are helping to extract information from text.
Human: The attributes to extract are: key people in city government and their roles
Human: Extract these attributes from the following text: ﻿Glasgow is a city which learns from its past and builds on its strengths. Our
people are the backbone of our city and have always shown a willingness to
adapt, change and reinvent during our long and rich history.
As leader of the city, I am proud to present Glasgow's first resilience strategy.
This document shows us how we can adapt and grow no matter what challenges the future holds.
Glasgow has weathered so much change throughout our history from the industrial revolution and the decline of our traditional industries to the recent economic downturn. However we have always shown a flexibility and strength of character which has

/var/folders/dg/t8hys0qd0fb57hkr74r677yw0000gn/T/ipykernel_58508/639925175.py:4: PydanticDeprecatedSince20: The `json` method is deprecated; use `model_dump_json` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  result = json.loads(answer['function'].json())


# TESTS

In [66]:
### TEST 1
test_1

['Amman',
 'Athens',
 'Bangkok',
 'Berkeley',
 'Boulder',
 'Bristol',
 'Byblos',
 'Dakar',
 'El Paso',
 'Glasgow',
 'Greater Christchurch',
 'Medelin',
 'Melbourne',
 'Mexico City',
 'New Orleans',
 'Norfolk',
 'Oakland',
 'Pittsburgh',
 'Quito',
 'Ramallah',
 'Rio de Janeiro',
 'Rotterdam',
 'San Francisco',
 'Santa Fe',
 'Semarang',
 'Surat',
 'Thessaloniki',
 'Toyama',
 'Vejles',
 'Wellington']

In [67]:
## TEST 2
test_2

['Australia',
 'Brazil',
 'Colombia',
 'Denmark',
 'Ecuador',
 'Greece',
 'India',
 'Indonesia',
 'Japan',
 'Jordan',
 'Lebanon',
 'Mexico',
 'Netherlands',
 'New Zealand',
 'Palestine',
 'Senegal',
 'Thailand',
 'UK',
 'USA']

In [68]:
## TEST 3
test_3

['and', 'in', 'of', 'the', 'to']

In [69]:
## TEST 4
test_4

'There is no getting away from the fact that these are difficult times for Glasgow.'

In [70]:
## TEST 5
test_5

GEAPP                            4
RESILIENCE VALUE                 3
Platform Partners                3
The Rockefeller Foundation       2
the City Resilience Framework    2
dtype: int64

In [71]:
## TEST 6
test_6

Rockefeller Foundation        22
Resilience Strategy           20
The Rockefeller Foundation    20
NGOs                          18
CRF                           16
dtype: int64

In [72]:
## TEST 7
test_7

"Glasgow's first resilience strategy, presented by the city leader, aims to enhance the city's ability to adapt and thrive amid future challenges, building on its historical strengths and community spirit. The strategy emphasizes tackling inequalities and prioritizing the needs of disadvantaged communities, informed by extensive engagement with over 3,500 residents. It outlines a roadmap for resilience through four strategic pillars: empowering Glaswegians, unlocking place-based solutions, innovating for fair economic growth, and fostering civic participation, with specific goals and actions to be implemented over the next two years. The strategy integrates existing city initiatives and aims to create a fairer, more just Glasgow, while also addressing chronic stresses such as poverty and inequality. The city’s membership in the 100 Resilient Cities network further supports its commitment to resilience, with ongoing collaboration among various stakeholders to ensure effective implementa

In [73]:
## TEST 8
test_8

'key people in city government and their roles'

In [74]:
## TEST 9
test_9

{'description': 'Relevant attributes',
 'properties': {'people': {'description': 'person name',
   'items': {},
   'title': 'People',
   'type': 'array'},
  'roles': {'description': "person's role in city government",
   'items': {},
   'title': 'Roles',
   'type': 'array'}},
 'required': ['people', 'roles'],
 'title': 'Attributes',
 'type': 'object'}

In [75]:
## TEST 10
test_10

{'people': [{'name': 'Cllr. Frank McAveety'}, {'name': 'Alastair Brown'}],
 'roles': ['Council Leader', 'Chief Resilience Officer']}