Jonathan Soma / [js4571@columbia.edu](mailto:js4571@columbia.edu) / [@dangerscarf](https://twitter.com/dangerscarf/)

iMEdD Forum 2023

🕹️ GitHub repo can be found at [https://github.com/jsoma/2023-imedd-ai-workshop](https://github.com/jsoma/2023-imedd-ai-workshop)

💻 Run this notebook as a live programming exercise [here on Google Colab](https://colab.research.google.com/github/jsoma/2023-imedd-ai-workshop/blob/main/docs/Making%20and%20breaking%AI%20in%20the%20newsroom.ipynb). Use `shift+enter` or the ▶️ button to run the code.

::: {.callout-note appearance="simple"}

This web page was made from a Jupyter notebook! To see how I made it, [see this tutorial about Quarto](https://github.com/jsoma/quarto-tutorial).

:::

In [None]:
#| output: false
!wget ai-report.pdf

In [None]:
#| output: false
!pip install -q --upgrade transformers openai langchain

In [None]:
from langchain.chat_models import ChatOpenAI

# You'll need your own OpenAI GPT API key! This one was mine,
# but I've deactivated it so I can publish this.
API_KEY = "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"

llm = ChatOpenAI(openai_api_key=API_KEY, model_name="gpt-3.5-turbo")

response = llm.predict("Give me a recipe for chocolate-chip cookies")
print(response)

## Headline generation

You should do this in ChatGPT! It's much better as a conversation than using a single perfect prompt to get the result you want.

## Article summaries

Sometimes you need to write little blurbs for articles: maybe for social posts, front pages, or newsletters.

Below is a NYTimes article I'm using as an example.

In [15]:
article = """
A few blocks away from where a 1-year-old boy died, possibly from fentanyl poisoning at his Bronx day care this month, an open-air drug market persists along a trash-strewn underpass.

On Wednesday, a man on a moped arrived in the late afternoon and about a dozen users ambled over to purchase drugs with dollar bills in their hands. People tied off their arms, prepped needles or packed pipes. After a while, the ground was littered with syringes and bloodstained alcohol swabs.

No police officers were seen, although they are often nearby — sometimes steps away in the Kingsbridge Road subway station. But they rarely intervene, according to local residents, elected officials who have tried to clean up the area for several years, and nonprofit workers who distribute food and clean needles at the site.

Residents and community leaders say they are regularly told by city officials that they are doing their best to make the area safer. But the Kingsbridge Road underpass is just one of many locations in the city where drug use has become more open, even as lives lost to overdoses are at a record high — roughly 3,200 such deaths citywide in 2022, according to an annual report by the city’s special narcotics prosecutor’s office.

The situation there underscores an increasingly difficult dilemma for the city: how to curb an epidemic killing thousands of New Yorkers and making neighborhoods feel unlivable for thousands of others, without reverting to aggressive crackdowns, which many leaders and public health experts said have led to civil rights abuses and did not effectively curb drug use.

The stakes are especially high with fentanyl, an extremely potent drug whose street version is becoming ever deadlier. It can kill a child if even a tiny amount is accidentally swallowed.

Areas of the city where drug use is often in plain view range from lower-income and working class areas like the Hub at 149th Street in the Bronx, to busy tourist-filled spots like Times Square and parts of the West Village. Other cities, like Portland, Ore., and Phoenix, are grappling with similar problems.

“The kids have to walk by it every day, exposed at a young age to very graphic scenes,” said Carol Rodriguez, 39, who was walking not far from the Kingsbridge underpass on her way to get her 9-year-old from school. She said things had deteriorated since the pandemic. “I worry that they grow up thinking this is normal.”

Tolerating low-level dealing also represents a broader threat beyond the public health crisis, some former law enforcement officials warn, because violence can follow, and because it makes it harder to find and prosecute those higher up the chain.

In the meantime, the question of how to avoid the collateral damage of drug use has only grown more urgent: Opioids have become the leading cause of child poisonings in the United States. More than 1,500 children died in fatal overdoses involving fentanyl in 2021, according to one study; over 100 were children under the age of 4.

Officials have not confirmed whether fentanyl was the cause of death for Nicholas Feliz Dominici, the 1-year-old who died in the Bronx on Sept. 15, but three other children from the same day care were hospitalized that day after they were exposed to fentanyl. Days after the child died, the police discovered a trap door under a play area concealing large, clear storage bags filled with narcotics. The day care’s operator and a man who lived in the apartment that housed the day care have been arrested and charged with murder and criminal drug possession.

The rising death toll comes as the city and the state have turned away from the aggressive law enforcement of low-level street drug activity that was common in the late 1990s. The shift has happened gradually over time, as a broader movement has pushed to reframe drug use as a public health crisis rather than as primarily a criminal issue.

New York City, for example, has the only city-sanctioned drug consumption sites in the nation, in East Harlem and Washington Heights, where people can use drugs under the supervision of trained workers who prevent overdoses while offering treatment to those who ask for it.

Proponents of this path say that criminalization has not worked, and has clearly not led to the elimination of drugs from our society. Rather, they say, it just pushes the problem out of view, and makes it harder for users to get help.

“The thing I refuse to do is say that the way to solve the problem is to throw more police at it,” said Gustavo Rivera, a state senator representing the Bronx, who has introduced a bill supporting decriminalization of all drugs. “We have to have a comprehensive approach.”

He added that if there was an overdose prevention center near the underpass, “you would not have those folks there. They would be in a non-stigmatized place, able to access services.”

There have been several forks in the road that have led more public officials to think the same way.

The so-called War on Drugs in the 1970s and 1980s aimed for a zero-tolerance approach. But it also led to the incarceration of millions of Black and Latino people across the country, often for nonviolent offenses. While the overall number of cocaine users declined during those years, the amount of drugs consumed stayed the same and the number of teenagers who tried illicit drugs rose, according to one study by the RAND Corporation.

In response, New York passed laws to address civil rights concerns, including one in 2019 that, among other things, significantly increased the amount of paperwork that had to be done after drug arrests, and gave prosecutors a shorter time frame to hand evidence over to the accused.

In 2021, Gov. Kathy Hochul signed a law that decriminalized the sale and possession of hypodermic needles, and also expanded the number of crimes in which those charged were eligible for diversion to drug treatment programs instead of prison. It was another signal to law enforcement that while possessing a small quantity of illegal drugs remains a crime — street use, in some ways, had essentially been decriminalized.

State bail reform laws, also passed in 2019, have allowed more people accused of crimes to return to the community shortly after their arrests. Whether this has actually increased crime is not clear, but even so, experts said the police are less likely to act aggressively if they know the people they arrest will be back on the street shortly afterward.

Bridget Brennan, the city’s special narcotics prosecutor, said that while the new laws were intended to reduce overdoses and reverse decades of harsh prison sentences for lower-level offenses, they also had the unintended effect of emboldening drug dealers. She said prosecutors will charge dealers three or four times and they still will not be held on bail.

“What that means in terms of drug dealers is they’re going to be more bold and blatant in their activity,” she said. “There is a lot of money to be made and there is not much of a deterrent.”

The number of narcotics arrests in the city closely tracks the policy shift. There were 27,232 narcotics arrests in 2018, according to police data. That dropped to 14,156 in 2020, at the height of the pandemic. Narcotics arrests for 2023 have risen, to 16,000 as of Sept. 17, a 34 percent increase from the same time period last year, but they remain well below 2018 levels.

Joseph Kenny, the chief of detectives for the Police Department, said that arresting a person for having a needle “was never a priority in our world.”

“We are not looking to take drug abusers and put them in prison,” Chief Kenny said. “We want them to get the help they need.”

As for arresting low-level street dealers, he said, prosecutors “are asking us to build bigger cases.”

“We need to target the dealers, the suppliers and the traffickers,” Chief Kenny said.

Civil rights concerns animate those opposed to more crackdowns; 94 percent of those prosecuted for narcotics charges in the Bronx were Black or Latino.

Supporters of decriminalization also point to data showing the lives saved at supervised consumption sites. Some express frustration that despite outward support by public officials, the centers still often exist in a legal purgatory, illegal under federal law, and lack government funding that might allow them to expand. Opponents say the sites encourage more drug use, especially on the blocks nearby.

Dr. Andrea Littleton provides medical care to drug users at the Kingsbridge underpass. She favors decriminalization, because she hopes that would lead to more regulation, and “hopefully get some of the fentanyl off the street.”

“At least then it would be less deadly to individuals, to babies and day cares,” she said.

One of the biggest concentrations of street drug activity in the city is near 125th Street and Lexington Ave in Harlem, near some drug treatment clinics. On one Wednesday in August, four officers stood on the southwest corner of the street, two of them looking at their phones.

When approached by a reporter, they said they had been told to stand there.

At one point, a different pair of police officers approached a busy corner for street drug activity — to give a summons to a man for public drinking.

Shawn Hill, co-founder of the Greater Harlem Coalition, a community group, has spent dozens of hours documenting open drug activity in the neighborhood in the hopes of reducing it. He seldom spots an arrest, he said.

“I think policing has changed dramatically in the last four or five years,” he said.

In a statement, the Police Department called the issue “a real concern to residents in all city neighborhoods.”

“There’s still work to be done, but our officers are more engaged and focused than ever,” the statement said.

Narcotics squads are still making thousands of arrests, including high-level drug busts, which are often undertaken with federal law enforcement. To underscore the risks of fentanyl after last week’s day care death, the city on Wednesday held a news conference to announce a large guns and drugs bust in Queens.

Ms. Brennan said that last year, her office seized about 1,000 pounds of fentanyl off the street and about one million pills. Police officials said an additional 150 officers have been added to narcotics squads recently, with plans for more. Each borough has two teams of officers assigned to investigate 911 and 311 calls about drug-related complaints. Additional units conduct “buy and bust” operations, where undercover officers make multiple drug buys from the same dealer in order to catch the more prolific street-level dealers.

“Our goal is that citizens shouldn’t have to walk past a drug dealer to get into their building,” Chief Kenny said.

Yet in some pockets, like Kingsbridge, several residents said it feels so unsafe that they have stopped leaving the house after 6 p.m. “You’ll come down the stairs with the kids in the morning and there’s someone sitting there, just shooting up,” said Chris Castellanos, 35, a father of four children.

Karla Cabrera Carrera, the district manager of Community Board 7, which includes Kingsbridge, said that society has to think about the rights of residents to feel secure, too.

“We are really facing a really bad issue,” she said. “I love the Bronx, I don’t want to have to move, but at this point, we are all desperate to find a solution to all of it.”
"""

I put together a template that we can use to send the article to GPT. The `{text}` part gets filled in with whatever we want - in this case it's the article, but we'll expand on this in other directions below.

In [26]:
template = """
Summarize the article below:

{text}
"""

prompt = template.format(text="This fills in the blank")
print(prompt)


Summarize the article below:

This fills in the blank



Now let's do all of the steps:

1. Connect to GPT
2. Create our prompt
3. Send it over

Note that if you get an error, make sure you're run the cell above that creates the article! You need to click it and shift+enter, or press the "Run" button.

In [8]:
from langchain.chat_models import ChatOpenAI

# You'll need your own OpenAI GPT API key! This one was mine,
# but I've deactivated it so I can publish this.
API_KEY = "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"

llm = ChatOpenAI(openai_api_key=API_KEY, model_name="gpt-3.5-turbo")

In [17]:
template = """
Summarize the article below:

{text}
"""

prompt = template.format(text=article)
response = llm.predict(prompt)
print(response)

New York City's open-air drug markets, in areas such as the Bronx's Kingsbridge Road underpass, are becoming more of a problem, with police rarely intervening and overdoses at a record high. The city's special narcotics prosecutor’s office has reported that 3,200 deaths occurred in 2022 due to overdoses. The dilemma for the city is how to address the epidemic without resorting to aggressive crackdowns that have previously led to civil rights abuses and have not effectively curbed drug use. The situation is particularly concerning with the rise of fentanyl, a powerful synthetic opioid, which is becoming increasingly deadly. Over 1,500 children died from fentanyl overdoses in 2021, with over 100 of them being under the age of four. The city and state have been moving away from aggressive law enforcement of street drugs, instead viewing drug use as a public health crisis.



This gets us an answer, but *can we write a better prompt?*

Think about asking GPT to use a specific tone, or cheat a little by telling it to copy a publication. You can also set a length, reading level or target audience!

I also recommend asking your tools for suggestions – you can try asking ChatGPT "how would you describe the writing style of the New York Times? Be detailed." and use that as a jumping-off point.

> Don't ever feel like your prompt is too long! Check the [EditorBot prompt](https://jamditis.notion.site/Superprompt-EditorBot-for-feedback-on-your-writing-0268051a8f384762a2686b1d43ad390b) - it's approximately ten million words long! You can always refine it later if you'd like.

In [98]:
template = """
Summarize the article below:

{text}
"""

prompt = template.format(text=article)
response = llm.predict(prompt)
print(response)

Open-air drug markets persist in New York City, even as the city grapples with a record high number of overdose deaths. The city's approach to drug use has shifted away from aggressive law enforcement towards treating it as a public health crisis, leading to a rise in visible drug use in areas ranging from low-income neighborhoods to tourist hubs. Critics argue this approach has led to an increase in violence and has made it harder to prosecute those higher up the drug chain. Meanwhile, the rise of fentanyl, a highly potent drug, has increased the stakes, with over 1,500 children dying from fatal overdoses in 2021. The city has responded by setting up supervised drug consumption sites, but these are often in a legal grey area and lack sufficient funding.


## Document summaries (research)

### Text file

In [100]:
# I'll add this later!
# You don't normally have text files, do you?

### PDF

PDFs are common sources for longer documents - reports, academic papers, etc.

I like using [pdfminer.six](https://pdfminersix.readthedocs.io/en/latest/) to extract text from PDFs, it's super time. We'll first pull the text out, then try to send it to GPT for a summary.

In [24]:
from pdfminer.high_level import extract_text

pdf_text = extract_text("data-files/ai-report.pdf")
print(pdf_text[:1000])

POLIS

Journalism at LSE

Generating Change
A global survey of what news 
organisations are doing with AI 

Charlie Beckett and Mira Yaseen

Preface

Our news media world has been turned upside down again. As always, serious technological 
change produces both dystopian and utopian hype. Much of this has been generated on social 
media by corporate PR and politicians. News coverage and expert commentary has also 
veered from excited coverage of positive breakthroughs in fields such as medicine to much 
more frightening visions of negative forces unleashed: Generative AI (genAI) is producing a 
tidal wave of automated, undetectable disinformation; it will amplify discrimination, extreme 
speech and inequalities. 

And its impact on journalism? Again, much of the coverage has focused on the unreliability of 
many genAI tools and the controversy over its rapacious appetite for other people’s data to train 
its algorithms. As the initial storm of hype turns into more practical considerati

Now let's use a simple summarization prompt.

In [27]:
template = """
Summarize the paper below:

{text}
"""

prompt = template.format(text=pdf_text)
response = llm.predict(prompt)
print(response)

InvalidRequestError: This model's maximum context length is 8192 tokens. However, your messages resulted in 33095 tokens. Please reduce the length of the messages.

**We get an error!** This is because our text is longer than GPT's **token limit.**

To make this summarization work, we have to **split our document into pieces!**

> Instead of the AI report, I'm also going to shift to using a slightly shorter academic paper! The AI report gets broken into about 90 pieces which takes a while to process. The science paper is only a *little* long, so it's just 21 segments.
>
> If you'd like to try the AI report (and wait a while in the process), feel free to change the filename.

We'll start by reading in our document, and using `load_and_split` to divide it up into separate pages. Later on we'll show an example of a more "intelligent" splitting that understands headers and sections, but this one is a simple example for now.

In [34]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('data-files/science-paper.pdf')
docs = loader.load_and_split()

len(docs)

21

Then we'll connect to GPT.

In [35]:
from langchain.chat_models import ChatOpenAI

# You'll need your own OpenAI GPT API key! This one was mine,
# but I've deactivated it so I can publish this.
API_KEY = "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"

llm = ChatOpenAI(openai_api_key=API_KEY, model_name="gpt-3.5-turbo")

Finally, we'll create a prompt to summarize the pieces. It's a lot of code, but I promise you'll (almost???) always use the same one every time.

In [36]:
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

prompt_template = """Summarize the following document.


TEXT: {text}


CONCISE SUMMARY:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(llm,
                             chain_type="map_reduce",
                             return_intermediate_steps=True,
                             map_prompt=PROMPT,
                             combine_prompt=PROMPT)

result = chain({"input_documents": docs}, return_only_outputs=True)

In [42]:
print(result['output_text'])

This document is a review of the nutrient composition and value of milk and plant-based milk alternatives. It compares the protein, fat, and calcium content of various plant-based milks to cow's milk. The review finds that plant-based milk alternatives often have lower nutritional value, particularly in terms of protein content. It also discusses other factors such as vitamin and mineral content and absorption. The document highlights the importance of understanding the implications of using plant-based milk alternatives as nutritional replacements. It mentions public health concerns related to the consumption of milk and provides references to studies on nutrition and health. Overall, the document concludes that plant-based milk alternatives are generally nutritionally inferior to cow's milk and should not be considered as complete nutritional alternatives.


## Translation

While we *could* use something like DeepL or Google Translate, why not just use the LLM for *absolutely everything?*

In [55]:
folktale = open("data-files/folktale-short.txt").read()
print(folktale[:1000])

A KÓRÓ ÉS A KIS MADÁR.

Egyszer volt, hol nem volt, volt a világon egy kis madár. Ez a kis madár
egyszer nagyon megunta magát, rászállt egy kóróra:

– Kis kóró ringass engemet!

– Nem ringatom biz én senki kis madarát!

A kis madár megharagudott, elrepült onnan. A mint ment mendegélt, talált
egy kecskét:

– Kecske! rágd el a kórót!

Kecske nem ment kóró rágni, a kóró még se’ ringatta a kis madarat.
Megint ment mendegélt a kis madár, talált egy farkast:

– Farkas! edd meg a kecskét!

Farkas nem ment kecske enni, kecske nem ment kóró rágni, kóró még se’
ringatta a kis madarat.

Megint ment mendegélt a kis madár, talált egy falut:

– Falu! kergesd el a farkast!

Falu nem ment farkas kergetni, farkas nem ment kecske enni, kecske nem
ment kóró rágni, a kóró még se’ ringatta a kis madarat;

Megint ment mendegélt a kis madár, talált egy tüzet:

– Tűz! égesd meg a falut.

Tűz nem ment falu égetni, falu nem ment farkas kergetni, farkas nem ment
kecske enni, kecske nem ment kóró rágni, a kóró mé

Like always, we'll start by connecting to ChatGPT, build a prompt, then send it over to be analyzed.

In [57]:
from langchain.chat_models import ChatOpenAI

# You'll need your own OpenAI GPT API key! This one was mine,
# but I've deactivated it so I can publish this.
API_KEY = "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"

llm = ChatOpenAI(openai_api_key=API_KEY, model_name="gpt-3.5-turbo")

template = """
Translate the text below into English:

{text}
"""

prompt = template.format(text=folktale)
response = llm.predict(prompt)
print(response)

ONCE UPON A TIME THERE WAS A DISEASE AND A LITTLE BIRD.

Once upon a time, in a land far away, there was a little bird. This little bird got very bored and landed on a branch:

- Little branch, rock me to sleep!

- I don't rock any little birds!

The little bird got angry and flew away. As it wandered, it found a goat:

- Goat, chew on the branch!

The goat didn't chew on the branch, and the branch still didn't rock the little bird. The little bird continued on its way and found a wolf:

- Wolf, eat the goat!

The wolf didn't eat the goat, the goat didn't chew on the branch, and the branch still didn't rock the little bird.

The little bird continued on its way and found a village:

- Village, chase away the wolf!

The village didn't chase away the wolf, the wolf didn't eat the goat, the goat didn't chew on the branch, and the branch still didn't rock the little bird.

The little bird continued on its way and found a fire:

- Fire, burn down the village.

The fire didn't burn down the 

What happens if we try with `gpt-4` instead of `gpt-3.5-turbo`?

In [58]:
from langchain.chat_models import ChatOpenAI

# You'll need your own OpenAI GPT API key! This one was mine,
# but I've deactivated it so I can publish this.
API_KEY = "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"

llm = ChatOpenAI(openai_api_key=API_KEY, model_name="gpt-4")

template = """
Translate the text below into English:

{text}
"""

prompt = template.format(text=folktale)
response = llm.predict(prompt)
print(response)

## Transcription (speech to text)

[Whisper](https://github.com/openai/whisper) is a transcription engine. You can use the code below to read from an mp3, or you can [try the interactive version](https://huggingface.co/spaces/openai/whisper).

In [61]:
import whisper

model = whisper.load_model("base")
result = model.transcribe("sample-audio.mp3")



In [62]:
print(result["text"][:1000])

 Okay, I was wondering if you could tell me a little bit about the program. What exactly goes on? Okay, well basically it's an intensive program designed to last four weeks, each session is four weeks long, although we do have two sessions in 1996 that will be two weeks long. But our general four weeks session involves seven hours of contact each day of Spanish. It involves starting at nine o'clock in the morning with three hours of grammar instruction. And then it's followed immediately by an hour of group conversation with the same instructor, same class, but we have extensive gardens here at the Institute and so students will usually go out to a garden setting because it's a little more comfortable. And then after a break for lunch, students return and we have a complete flip of instruction in that instead of textbook and clinical type instruction, we have hands on and we offer workshops in Mejibak's drop leaving, regional cooking of Wahaka at Donpas-Nal pottery. And if we have larg

In the example above, we use the `base` model. You can see [the different model versions here](https://github.com/openai/whisper#available-models-and-languages), which vary in their accuracy and knowledge of non-English languages.

## Reading text out loud (text to speech)

[Text to speech models on HuggingFace](https://huggingface.co/models?pipeline_tag=text-to-speech&sort=trending). They... are mostly not that good. Usually.

### Voice cloning

I'm just going to send you to [ElevenLabs](https://elevenlabs.io/). I showed you an example in class, but I don't have an interactive one here!

## Text classification

### Zero-shot classification

Because large language models are remarkably well-read, they're very good at categorizing content for you. If your categories are nuanced and specific it's a little more trouble, but if you're categorizing content across *broad categories* it should do a great job.

Here's a basic example of how it works:

In [None]:
prompt = """
Categorize the following text as being about ENVIRONMENT, GUN CONTROL,
or IMMIGRATION. Respond with only the category.

Text: A Bill to Regulate the Sulfur Emissions of Coal-Fired Energy
Plants in the State of New York.
"""

response = llm.predict(prompt)
print(response)

Instead of writing the big long `"Categorize the following text..."` prompt for *every single bill*, we can also use Python's `.format`. It allows us to write the prompt once, then fill in the blanks for the part that changes each time.

In [None]:
template = """
Categorize the following text as being about ENVIRONMENT, GUN CONTROL,
or IMMIGRATION. Respond with only the category.

Text: {bill_text}
"""

print(template.format(bill_text="This fills in the spot in the template"))

This is super convenient if we want to categorize many things! For example, you might use a loop...

In [None]:
template = """
Categorize the following text as being about ENVIRONMENT, GUN CONTROL,
or IMMIGRATION. Respond with only the category.

Text: {bill_text}
"""

bills = [
    "A Bill to Allow Additional Refugees In Upstate New York",
    "A Bill to Close Down Coal-fired Power Plants",
    "A Bill to Banning Assault Rifles at Public Events"
]

for bill in bills:
    prompt = template.format(bill_text=bill)
    response = llm.predict(prompt)
    print(bill, "is", response)

...or if you have a pandas DataFrame, you can use `.apply` and build a new column for your category!

In [5]:
import pandas as pd

pd.set_option("display.max_colwidth", 500)

df = pd.DataFrame({
    'title': [
        "A Bill to Allow Additional Refugees In Upstate New York",
        "A Bill to Close Down Coal-fired Power Plants",
        "A Bill to Banning Assault Rifles at Public Events"
    ]
})
df

Unnamed: 0,title
0,A Bill to Allow Additional Refugees In Upstate New York
1,A Bill to Close Down Coal-fired Power Plants
2,A Bill to Banning Assault Rifles at Public Events


In [None]:
template = """
Categorize the following text as being about ENVIRONMENT, GUN CONTROL,
or IMMIGRATION. Respond with only the category.

Text: {bill_text}
"""

def make_prediction(row):
    text = row['title']
    prompt = template.format(bill_text=text)
    response = llm.predict(prompt)
    return response

# Send each row to make_prediction and store the category in a new column
df['predicted_category'] = df.apply(make_prediction, axis=1)

It's like magic!

### GPT in Google Sheets

You can also do this with GPT [in Google Sheets](https://www.makeuseof.com/how-use-chatgpt-google-sheets/) instead of relying on code.

### Few-shot classification

Sometimes your classifier might not do a great job. It might be a topic the LLM doesn't know much about, you might be asking for nuance it doesn't have out of the box, or a million and one other things might be going wrong. In those cases, it's useful to try **few-shot classification**.

In the same way that zero-shot classification takes zero examples going in, few-shot classification takes... a few examples!

In [None]:
template = """
Categorize the following text as being about ENVIRONMENT, GUN CONTROL,
or IMMIGRATION. Respond with only the category.

Text: Something about immigration
IMMIGRATION

Text: Something about gun control
GUN CONTROL

Text: {bill_text}
"""

print(template.format(bill_text="This fills in the spot in the template"))

### Custom classifiers

You can also use [Hugging Face AutoTrain](https://ui.autotrain.huggingface.co/) to train your own classifiers. Instead of trusting that the model has enough information about the world to make a decision, you give it exactly what it needs to know.

They aren't necessarily worse or better: they're just more work on your end because you have to manually label each category to show it what you mean. Instead of "few shot" it's "a whole lot of shots" - probably hundreds!

## Format shifting

A lot of the power of these models involves format shifting, changing from one format to another. Audio, text, images, video.

* **Video to image:** then analyze the images
* **Audio to text:** transcription
* **Text to image:** I think this one is immoral. Art is resistant to errors, so it's much easier to just rip off someone's style. It's also trained on peoples' work in a way that's far more obviously inappropriate than text models.
* **Image to text:** accessibility and research

If you have a lot of content in one category, it's worth thinking about what you can do to convert it into another category and re-use it.

## Image models

The three major things you can do with images is:

1. Categorize your images
2. Detect objects in the images
3. Separate your image into pieces (pixel counting)

If you use [HuggingFace AutoTrain](https://ui.autotrain.huggingface.co/), you can easily make your own models! I made one [to classify illegal amber mines](https://huggingface.co/wendys-llc/amber-mines) based on [Texty's work](https://texty.org.ua/d/2018/amber_eng/).

In [64]:
from transformers import pipeline

classifier = pipeline("image-classification", model="wendys-llc/amber-mines")

Downloading (…)lve/main/config.json:   0%|          | 0.00/763 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/343M [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/325 [00:00<?, ?B/s]

What does the model think about this image?

![](amber-sample-1.png)

In [65]:
classifier("amber-sample-1.png")

[{'score': 0.9965369701385498, 'label': 'negative'},
 {'score': 0.003463044995442033, 'label': 'positive'}]

How about this one?

![](amber-sample-1.png)

In [66]:
classifier("amber-sample-2.png")

[{'score': 0.9864670634269714, 'label': 'positive'},
 {'score': 0.013532912358641624, 'label': 'negative'}]

### Object detection

[Here is an example of object detection](https://huggingface.co/spaces/wendys-llc/OWL-ViT)

You can combine this with a large language model to ask questions of images. "Are there cats in this image?" It's like a more flexible classifier – but with limitations about what the model knows about.

### Semantic segmentation

[Here is an example of semantic segmentation](https://huggingface.co/spaces/thiagohersan/maskformer-satellite-trees-gradio)

How much of this image is pavement? Or grass? Or trees? Or cars?

Not too useful with normal imagery, but very useful for satellite or aerial imagery.

### Panoptic segmentation

[Here is an example of panoptic segmentation](https://huggingface.co/spaces/wendys-llc/panoptic-segment-anything)

Panoptic segmentation does both object detection *and* semantic segmentation.

## Semantic search

A combination of semantic search and keyword matching usually outperforms semantic search by itself.

Transcribing all of your interviews and then doing searches across them is a beautiful thing.

Give [Semantra](https://github.com/freedmand/semantra) a try to run it locally.

In [None]:
!pip install -q sentence-transformers sentencepiece

In [70]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings[0][:50])

[ 0.0676569   0.06349581  0.0487131   0.07930496  0.03744796  0.00265277
  0.03937485 -0.00709837  0.0593615   0.03153696  0.06009803 -0.05290522
  0.04060676 -0.02593078  0.02984274  0.00112689  0.07351495 -0.05038185
 -0.12238666  0.02370274  0.02972649  0.04247681  0.0256338   0.00199517
 -0.05691912 -0.02715985 -0.03290359  0.06602488  0.11900704 -0.04587924
 -0.07262138 -0.03258408  0.05234135  0.04505523  0.00825302  0.03670237
 -0.01394151  0.06539196 -0.02642729  0.00020634 -0.01366437 -0.03628108
 -0.0195043  -0.02897387  0.03942709 -0.08840913  0.00262434  0.01367143
  0.04830637 -0.03115652]


In [71]:
import pandas as pd

sentences = [
    "Molly ate a fish",
    "Jen consumed a carp",
    "I would like to sell you a house",
    "Я пытаюсь купить дачу", # I'm trying to buy a summer home
    "J'aimerais vous louer un grand appartement", # I would like to rent a large apartment to you
    "This is a wonderful investment opportunity",
    "Это прекрасная возможность для инвестиций", # investment opportunity
    "C'est une merveilleuse opportunité d'investissement", # investment opportunity
    "これは素晴らしい投資機会です", # investment opportunity
    "野球はあなたが思うよりも面白いことがあります", # baseball can be more interesting than you think
    "Baseball can be interesting than you'd think"
]

In [72]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

In [73]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities exactly the same as we did before!
similarities = cosine_similarity(embeddings)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)

Unnamed: 0,Molly ate a fish,Jen consumed a carp,I would like to sell you a house,Я пытаюсь купить дачу,J'aimerais vous louer un grand appartement,This is a wonderful investment opportunity,Это прекрасная возможность для инвестиций,C'est une merveilleuse opportunité d'investissement,これは素晴らしい投資機会です,野球はあなたが思うよりも面白いことがあります,Baseball can be interesting than you'd think
Molly ate a fish,1.0,0.526053,0.025476,0.098335,0.020435,-0.065293,0.035801,-0.062506,0.027358,0.017622,0.023445
Jen consumed a carp,0.526053,1.0,0.044178,0.035044,-0.018194,-0.004438,-0.078566,-0.011418,0.090357,0.131507,0.0161
I would like to sell you a house,0.025476,0.044178,1.0,0.154773,0.083555,0.386736,0.017175,-0.006744,0.010857,0.02551,0.006353
Я пытаюсь купить дачу,0.098335,0.035044,0.154773,1.0,0.159519,0.064379,0.462397,0.09211,0.314708,0.327675,-0.119607
J'aimerais vous louer un grand appartement,0.020435,-0.018194,0.083555,0.159519,1.0,0.032253,0.365505,0.566635,0.172406,0.110118,-0.013743
This is a wonderful investment opportunity,-0.065293,-0.004438,0.386736,0.064379,0.032253,1.0,-0.030322,0.21223,0.023889,-0.002844,0.112804
Это прекрасная возможность для инвестиций,0.035801,-0.078566,0.017175,0.462397,0.365505,-0.030322,1.0,0.282414,0.267571,0.285873,-0.040309
C'est une merveilleuse opportunité d'investissement,-0.062506,-0.011418,-0.006744,0.09211,0.566635,0.21223,0.282414,1.0,0.292651,0.187989,0.006793
これは素晴らしい投資機会です,0.027358,0.090357,0.010857,0.314708,0.172406,0.023889,0.267571,0.292651,1.0,0.577265,-0.10063
野球はあなたが思うよりも面白いことがあります,0.017622,0.131507,0.02551,0.327675,0.110118,-0.002844,0.285873,0.187989,0.577265,1.0,-0.098722


In [74]:
model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2')
embeddings = model.encode(sentences)

In [75]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities exactly the same as we did before!
similarities = cosine_similarity(embeddings)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)

Unnamed: 0,Molly ate a fish,Jen consumed a carp,I would like to sell you a house,Я пытаюсь купить дачу,J'aimerais vous louer un grand appartement,This is a wonderful investment opportunity,Это прекрасная возможность для инвестиций,C'est une merveilleuse opportunité d'investissement,これは素晴らしい投資機会です,野球はあなたが思うよりも面白いことがあります,Baseball can be interesting than you'd think
Molly ate a fish,1.0,0.358347,0.05834,0.145439,-0.024103,-0.070145,-0.075333,-0.073496,-0.111467,-0.025614,0.001549
Jen consumed a carp,0.358347,1.0,0.059195,0.190241,-0.001941,-0.024359,-0.024816,-0.023295,-0.087019,0.040799,0.067243
I would like to sell you a house,0.05834,0.059195,1.0,0.418692,0.642746,0.081795,0.118611,0.067805,0.04256,0.144491,0.1393
Я пытаюсь купить дачу,0.145439,0.190241,0.418692,1.0,0.351605,0.120679,0.184644,0.144633,0.115598,0.050505,0.046084
J'aimerais vous louer un grand appartement,-0.024103,-0.001941,0.642746,0.351605,1.0,0.203307,0.238716,0.204762,0.195163,0.201317,0.151998
This is a wonderful investment opportunity,-0.070145,-0.024359,0.081795,0.120679,0.203307,1.0,0.953561,0.964282,0.945246,0.062618,0.13322
Это прекрасная возможность для инвестиций,-0.075333,-0.024816,0.118611,0.184644,0.238716,0.953561,1.0,0.968368,0.944719,0.084221,0.136699
C'est une merveilleuse opportunité d'investissement,-0.073496,-0.023295,0.067805,0.144633,0.204762,0.964282,0.968368,1.0,0.959357,0.086458,0.146568
これは素晴らしい投資機会です,-0.111467,-0.087019,0.04256,0.115598,0.195163,0.945246,0.944719,0.959357,1.0,0.091451,0.115392
野球はあなたが思うよりも面白いことがあります,-0.025614,0.040799,0.144491,0.050505,0.201317,0.062618,0.084221,0.086458,0.091451,1.0,0.839617


### Searching through documents with semantic search

What does this allow you to do?

In [77]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('ai-report.pdf')
docs = loader.load_and_split()

len(docs)

90

In [78]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name='paraphrase-multilingual-MiniLM-L12-v2')
db = Chroma.from_documents(docs, embeddings)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.


In [83]:
db.similarity_search("losing jobs", k=4)

[Document(lc_kwargs={'page_content': '7\n12 Cultural resistance and fears of job displacement and scepticism of AI technologies \ncannot be discounted. \n13 Across the board, respondents noted that mitigating AI integration challenges requires \nbridging knowledge gaps among various teams in the newsroom. Similarly, cross-\ndepartment collaboration was seen as necessary for achieving effective AI adoption.\n14 The challenge of keeping pace with the rapid evolution of AI was consistently \nmentioned throughout the survey.\n15 About 40% of respondents said their approach to AI has not changed over the past \nfew years, either because they are still in the beginning of their AI journey or because \nAI integration remains limited in their newsrooms. Concurrently, around a 1/4 said \ntheir organisation’s approach to AI has evolved; they have gained hands-on experience \nthat helps them think more realistically about AI.\n16 More than 60% of respondents are concerned about the ethical implic

### Specifying details

Let's get a little more specific!

In [88]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader('hungarian-folktales.txt')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
len(docs)

453

In [89]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name='paraphrase-multilingual-MiniLM-L12-v2')
db = Chroma.from_documents(docs, embeddings)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
Exiting: Cleaning up .chroma directory


In [97]:
matches = db.similarity_search("weddings at a festival with loud music", k=1)

for match in matches:
    print(match.page_content)
    print("-----")

Eltelt az egy hónap, elérkezett az esküvő napja, ott volt a sok vendég,
köztök a boltos is, csak a vőlegényt meg a menyasszonyt nem lehetett
látni. Bekövetkezett az ebéd ideje is, mindnyájan vígan ültek le az
asztalhoz, elkezdtek enni. Az volt a szokás a gróf házánál, hogy minden
embernek egy kis külön tálban vitték az ételt; a boltos amint a maga
táljából szedett levest, hát csak alig tudta megenni, olyan sótalan
volt, nézett körül só után, de nem volt az egész asztalon; a második
étel még sótalanabb volt, a harmadik meg már olyan volt, hogy hozzá se’
tudott nyúlni. Kérdezték tőle hogy mért nem eszik? tán valami baja van
az ételnek? amint ott vallatták, eszébe jutott a lyánya, hogy az neki
azt mondta, hogy úgy szereti, mint a sót, elkezdett sírni; kérdezték
aztán tőle, hogy mért sír, akkor elbeszélt mindent, hogy volt neki egy
lyánya, az egyszer neki azt mondta, hogy úgy szereti mint a sót, ő
megharagudott érte, elkergette a házától, lám most látja, hogy milyen
-----


## Retrieval-augmented generation (smart chatbots)

Find relevant passages, send them to your chatbot along with your question.

::: {.callout-note appearance="simple"}
You can improve performance by generating potential answers, then also including things similar to potential answers.
:::

There are a lot of places things can go wrong: segmented poorly or missing context, non-useful embeddings, question not being answered incorrectly. It's probably best if you just use semantic search to get results and read them yourself!

## Expanding your skillset

Lorem ipsum

## Contact me

Feel free to reach out – I'm more than happy to provide ideas, guidance, lectures, etc etc etc. You can find me via email at [js4571@columbia.edu](mailto:js4571@columbia.edu) or on Twitter at [@dangerscarf](https://twitter.com/dangerscarf). I also run two data journalism programs at Columbia: the 12-month [Data Journalism MS](https://journalism.columbia.edu/ms-data-journalism) and the [Lede Program](https://ledeprogram.com/), a summer intensive.