## Installing prerequisites

In [None]:
!pip install -q --upgrade torch transformers \
    sentence-transformers sentencepiece \
    protobuf==3.20 pystemmer eli5 \
    openai-whisper scikit-learn chromadb \
    openai langchain==0.0.198

In [124]:
from IPython.display import HTML, display

def set_css():
    display(HTML('''
    <style>
    pre {
        white-space: pre-wrap;
    }
    </style>
    '''))
get_ipython().events.register('pre_run_cell', set_css)

## Downloading our data

In [None]:
!wget https://raw.githubusercontent.com/jsoma/2023-journalismai/main/book.txt
!wget https://raw.githubusercontent.com/jsoma/2023-journalismai/main/folktale.txt
!wget https://raw.githubusercontent.com/jsoma/2023-journalismai/main/wapo-reviews-marked.csv
!wget https://raw.githubusercontent.com/jsoma/2023-journalismai/main/nytimes-story.txt
!wget https://raw.githubusercontent.com/jsoma/2023-journalismai/main/6313.mp3

## Sentiment analysis

Sentiment analysis is a judge of whether a text is **positive** or **negative**.

In [None]:
from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love sandwiches"]
sentiment_pipeline(data)

Oh, it looks like we should [specify a model?](https://huggingface.co/models) Let's just use the default.

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis",
                             model="distilbert-base-uncased-finetuned-sst-2-english")
data = ["I love sandwiches"]
sentiment_pipeline(data)

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis",
                             model="distilbert-base-uncased-finetuned-sst-2-english")
data = ["j'adore les sandwichs"]
sentiment_pipeline(data)

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis",
                             model="distilbert-base-uncased-finetuned-sst-2-english")
data = ["я люблю бутерброды"]
sentiment_pipeline(data)

If we want to try another one, we can look at [the most popular ones](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=sentiment).

In [None]:
sentiment_pipeline = pipeline("sentiment-analysis",
                             model="cardiffnlp/twitter-xlm-roberta-base-sentiment")
data = ["я люблю бутерброды"]
sentiment_pipeline(data)

## Classification

**Classification** is a classic problem in investigative journalism.

You have a lot of documents: how do you find the ones you're interested in?

- Atlanta Journal-Constitution: [Doctors & Sex Abuse: Still forgiven](https://doctors.ajc.com/)
- Washington Post: [Apple says its App Store is ‘a safe and trusted place.’ We found 1,500 reports of unwanted sexual behavior on six apps, some targeting minors.](https://www.washingtonpost.com/technology/2019/11/22/apple-says-its-app-store-is-safe-trusted-place-we-found-reports-unwanted-sexual-behavior-six-apps-some-targeting-minors/)

### The old approach

Historically, you labeled a subset, then used a **machine learning algorithm** that scored the rest of them.

In [None]:
import pandas as pd
pd.set_option("display.max_colwidth", 300)

df = pd.read_csv("wapo-reviews-marked.csv")
df.head()

In [None]:
known = df[df.sexual.notna()].copy()
unknown = df[df.sexual.isna()].copy()

In [None]:
known.head()

In [None]:
%%time

from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer

stemmer = Stemmer.Stemmer('en')

analyzer = TfidfVectorizer().build_analyzer()

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords(analyzer(doc))

vectorizer = StemmedTfidfVectorizer(max_features=500, max_df=0.30)
matrix = vectorizer.fit_transform(known.Review)

words_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names_out())
words_df.head(5)

In [None]:
from sklearn.svm import LinearSVC

X = matrix
y = known.sexual

clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)

In [None]:
X = vectorizer.transform(unknown.Review)

unknown['predicted'] = clf.predict(X)
unknown['predicted_proba'] = clf.decision_function(X)

In [None]:
unknown

In [None]:
unknown[unknown.predicted == 1].head()

In [None]:
# The top 1000 most likely creepy reviews

creepy_df = unknown.sort_values(by='predicted_proba',
                                ascending=False).head(1000)
creepy_df.head()

### Using a fine-tuned language model

The modern update to this might use [HuggingFace AutoTrain](https://huggingface.co/autotrain) to create a custom model. It will (potentially) be more effective than your old-fashioned machine learning model, with fewer parameters to tweak.

I trained a small model called [creepy-wapo](https://huggingface.co/wendys-llc/creepy-wapo).

In [None]:
from transformers import pipeline

creepy_pipeline = pipeline(model="wendys-llc/creepy-wapo")
data = [
    "I love the app, talking to people is fun",
    "Be careful talking to men, they all want nudes :("
]

creepy_pipeline(data)

### Using zero-shot classification with GPT

The *most* advanced method is to [just ask GPT](https://chat.openai.com/). This is called zero-shot classification because it doesn't need any examples!

While you could [just use ChatGPT like the Marshall Project did](https://generative-ai-newsroom.com/decoding-bureaucracy-5b0c1411171), using code is much faster and much more controllable.

In [None]:
from langchain.chat_models import ChatOpenAI

# You'll need your own OpenAI GPT API key! 
# https://platform.openai.com/apps
API_KEY = "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"

# Faster/cheaper
MODEL = 'gpt-3.5-turbo'

# Better results (I'm impatient, so we're using turbo!)
# MODEL = 'gpt-4'

llm = ChatOpenAI(openai_api_key=API_KEY, model_name=MODEL)

Here is an example of talking to GPT using Python code.

In [None]:
response = llm.predict("Give me a recipe for chocolate-chip cookies")
print(response)

Here is an example of zero-shot classification

In [None]:
prompt = """
Categorize the following text as being about ENVIRONMENT, GUN CONTROL,
or IMMIGRATION. Respond with only the category.

Text: A Bill to Regulate the Sulfur Emissions of Coal-Fired Energy
Plants in the State of New York.
"""

response = llm.predict(prompt)
print(response)

Normally you would use this for a whole lot of different bills, so it would be best to design a template that you can fill in text for.

In [None]:
template = """
Categorize the following text as being about ENVIRONMENT, GUN CONTROL,
or IMMIGRATION. Respond with only the category.

Text: {bill_text}
"""

bills = [
    "A Bill to Allow Additional Refugees In Upstate New York",
    "A Bill to Close Down Coal-fired Power Plants",
    "A Bill to Banning Assault Rifles at Public Events"
]

for bill in bills:
    prompt = template.format(bill_text=bill)
    response = llm.predict(prompt)
    print(bill, "is", response)

## Summarization

Let's say we wanted to summarize [this story from the NYT](https://www.nytimes.com/2023/08/08/business/china-youth-unemployment.html) about youth unemployment in China. We have a few options!

In [None]:
text = open("nytimes-story.txt").read()
text[:2000]

### Using a Hugging Face model to summarize

Using a Hugging Face model is free, fast and private. For example, we can use [this model originally created by Facebook](https://huggingface.co/facebook/bart-large-cnn), which is a [popular model for summarization](https://huggingface.co/models?pipeline_tag=summarization&sort=trending).

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

In [None]:
# We can't send the whole text! We're only sending the first half.

result = summarizer(text[:4000], max_length=300, min_length=30)
print(result['summary_text'])

### Using GPT to summarize

On the other hand, GPT results might be more expensive (and less private), but the quality will certainly be much higher.

In [None]:
template = """
Write a concise summary of the following text.

TEXT: {story_text}
"""

prompt = template.format(story_text=text)
response = llm.predict(prompt)

print(response)

We can use **prompt engineering** to customize our results. You can learn more at [Prompt Engineering for Developers](https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/)

In [None]:
template = """
Write a concise summary of the following text in bullet-point format.
Address topics as action items, and assume the reader knows the basic
facts of the situation.

TEXT: {story_text}
"""

prompt = template.format(story_text=text)
response = llm.predict(prompt)

print(response)

### Summarizing longer texts

In [None]:
text = open("folktale.txt").read()
text[:3000]

In [None]:
template = """
Write a concise summary of the following text.

TEXT: {story_text}
"""

# The below will give us an error
# prompt = template.format(story_text=text)
# response = llm.predict(prompt)

# print(response)

Instead, we need to split it up into several pieces and summarize them one at a time.

In [None]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader('folktale.txt', encoding='utf-8')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2500)
docs = text_splitter.split_documents(documents)
len(docs)

In [None]:
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

prompt_template = """Write a concise summary of the following text.

TEXT: {text}


CONCISE SUMMARY IN ENGLISH:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(llm,
                             chain_type="map_reduce",
                             return_intermediate_steps=True,
                             map_prompt=PROMPT,
                             combine_prompt=PROMPT)

result = chain({"input_documents": docs}, return_only_outputs=True)

In [None]:
print(result['output_text'])

## Embeddings and semantic search

In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["cat"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings[0][:25])

In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings[0][:50])

In [None]:
import pandas as pd

sentences = [
    "Molly ate a fish",
    "Jen consumed a carp",
    "I would like to sell you a house",
    "Я пытаюсь купить дачу", # I'm trying to buy a summer home
    "J'aimerais vous louer un grand appartement", # I would like to rent a large apartment to you
    "This is a wonderful investment opportunity",
    "Это прекрасная возможность для инвестиций", # investment opportunity
    "C'est une merveilleuse opportunité d'investissement", # investment opportunity
    "これは素晴らしい投資機会です", # investment opportunity
    "野球はあなたが思うよりも面白いことがあります", # baseball can be more interesting than you think
    "Baseball can be more interesting than you'd think"
]

In [None]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities exactly the same as we did before!
similarities = cosine_similarity(embeddings)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)

In [None]:
model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2')
embeddings = model.encode(sentences)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities exactly the same as we did before!
similarities = cosine_similarity(embeddings)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)

### Searching across a database

In [None]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader('book.txt')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
len(docs)

In [None]:
docs[10]

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

embeddings = HuggingFaceEmbeddings(model_name='paraphrase-multilingual-MiniLM-L12-v2')
docsearch = Chroma.from_documents(docs, embeddings)

In [None]:
scores = embeddings.embed_documents(["What did Zsuzska steal from the devil?"])[0]
len(scores)

In [None]:
print(scores[:20])

In [None]:
# k=1 because we only want one result
docsearch.similarity_search("What did Zsuzska steal from the devil?", k=4)

## Document-based question-and-answer

We can then use the related documents to answer questions. The example below sends the top few results to GPT along with our question. This is called **document-based question-and-answer with semantic search**. Be careful, though, it isn't perfect!

In [None]:
from langchain.chat_models import ChatOpenAI

# You'll need your own OpenAI GPT API key!
# https://platform.openai.com/apps
API_KEY = "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"

# Faster/cheaper
MODEL = 'gpt-3.5-turbo'

# Better results (I'm impatient, so we're using turbo!)
# MODEL = 'gpt-4'

llm = ChatOpenAI(openai_api_key=API_KEY, model_name=MODEL, temperature=0)

In [None]:
from langchain.chains import RetrievalQAWithSourcesChain

chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever()
)

In [None]:
query = "What did Zsuzska steal from the devil?"

result = qa.run(query)
print(result)

In [None]:
query = "What did Zsuzska steal from the devil? Be sure to name everything!"

result = qa.run(query)
print(result)

## Transcription

We can use [Whisper](https://github.com/openai/whisper) to transcribe audio.

In [None]:
import whisper

In [None]:
%%time

model = whisper.load_model("tiny")

result = model.transcribe("6313.mp3")
result['text']

In [None]:
%%time

model = whisper.load_model("base")

result = model.transcribe("6313.mp3")
result['text']

## What do you want to try to do?