## Installing prerequisites

In [111]:
!pip install -q --upgrade torch transformers \
    sentence-transformers sentencepiece \
    protobuf==3.20 pystemmer eli5 \
    openai-whisper scikit-learn \
    openai langchain==0.0.198


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


## Downloading our data

In [None]:
!wget https://raw.githubusercontent.com/jsoma/2023-journalismai/main/book.txt
!wget https://raw.githubusercontent.com/jsoma/2023-journalismai/main/folktale.txt
!wget https://raw.githubusercontent.com/jsoma/2023-journalismai/main/wapo-reviews-marked.csv
!wget https://raw.githubusercontent.com/jsoma/2023-journalismai/main/nytimes-story.txt
!wget https://raw.githubusercontent.com/jsoma/2023-journalismai/main/6313.mp3

## Sentiment analysis

Sentiment analysis is a judge of whether a text is **positive** or **negative**.

In [116]:
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love sandwiches"]
sentiment_pipeline(data)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9994644522666931}]

Oh, it looks like we should [specify a model?](https://huggingface.co/models) Let's just use the default.

In [117]:
sentiment_pipeline = pipeline("sentiment-analysis",
                             model="distilbert-base-uncased-finetuned-sst-2-english")
data = ["I love sandwiches"]
sentiment_pipeline(data)

[{'label': 'POSITIVE', 'score': 0.9994644522666931}]

In [118]:
sentiment_pipeline = pipeline("sentiment-analysis",
                             model="distilbert-base-uncased-finetuned-sst-2-english")
data = ["j'adore les sandwichs"]
sentiment_pipeline(data)

[{'label': 'POSITIVE', 'score': 0.9986036419868469}]

In [120]:
sentiment_pipeline = pipeline("sentiment-analysis",
                             model="distilbert-base-uncased-finetuned-sst-2-english")
data = ["я люблю бутерброды"]
sentiment_pipeline(data)

[{'label': 'POSITIVE', 'score': 0.6742483377456665}]

If we want to try another one, we can look at [the most popular ones](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=sentiment).

In [121]:
sentiment_pipeline = pipeline("sentiment-analysis",
                             model="cardiffnlp/twitter-xlm-roberta-base-sentiment")
data = ["я люблю бутерброды"]
sentiment_pipeline(data)

[{'label': 'positive', 'score': 0.7812552452087402}]

## Classification

**Classification** is a classic problem in investigative journalism.

You have a lot of documents: how do you find the ones you're interested in?

- Atlanta Journal-Constitution: [Doctors & Sex Abuse: Still forgiven](https://doctors.ajc.com/)
- Washington Post: [Apple says its App Store is ‘a safe and trusted place.’ We found 1,500 reports of unwanted sexual behavior on six apps, some targeting minors.](https://www.washingtonpost.com/technology/2019/11/22/apple-says-its-app-store-is-safe-trusted-place-we-found-reports-unwanted-sexual-behavior-six-apps-some-targeting-minors/)

### The old approach

Historically, you labeled a subset, then used a **machine learning algorithm** that scored the rest of them.

In [22]:
import pandas as pd
pd.set_option("display.max_colwidth", 300)

df = pd.read_csv("wapo-reviews-marked.csv")
df.head()

Unnamed: 0,Rating,Review,source,racism,bullying,sexual
0,5,It’s a great app to meet new people and chat in very satisfied with downloading this app i recommend this app if you like to chat or just to meet new people. And you can choose which country To find different users!,holla,,,
1,5,"Holla is an excellent app, where I get to know new people every time and even get to make new friends. I truly recommend this application to all people!",holla,,,
2,1,Get rid of micro transactions or i will find a new app to use. Why should i have to pay for that it’s so stupid,holla,0.0,0.0,0.0
3,5,"Free to use app, meet people around the world.",holla,,,
4,5,I got this app and everything has been different. I’ve met so many interesting people. From around the world. I was recently reunited with my high school girlfriend. We’re getting married. I met and married The love of my Life thanks to Holla. Thanks Holla!!!!!,holla,,,


In [23]:
known = df[df.sexual.notna()].copy()
unknown = df[df.sexual.isna()].copy()

In [24]:
known.head()

Unnamed: 0,Rating,Review,source,racism,bullying,sexual
2,1,Get rid of micro transactions or i will find a new app to use. Why should i have to pay for that it’s so stupid,holla,0.0,0.0,0.0
6,1,This is good but most of my messages never show up. This is very crapy and needs to be fixed,skout,0.0,0.0,0.0
8,1,I was really enjoying this app. This brought me out of the box. I’m an extremely shy person and this gave me somewhere to talk to nice people. I just got kicked of bc I’m 16 not “18” and I think that this change it kind of stupid bc yeah it’s for protection but like someone else said all you hav...,holla,0.0,0.0,0.0
13,1,It won’t lemme go live or anything like I think you fixed it for everyone but me and now it says I’m banned for no reason I didn’t even do anything,holla,0.0,0.0,0.0
15,1,No real ppl all fake or no reply,skout,0.0,0.0,0.0


In [25]:
%%time

from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer

stemmer = Stemmer.Stemmer('en')

analyzer = TfidfVectorizer().build_analyzer()

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords(analyzer(doc))

vectorizer = StemmedTfidfVectorizer(max_features=500, max_df=0.30)
matrix = vectorizer.fit_transform(known.Review)

words_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names_out())
words_df.head(5)

CPU times: user 30.2 ms, sys: 7.67 ms, total: 37.9 ms
Wall time: 35.8 ms


Unnamed: 0,10,100,18,24,30,50,abl,about,accept,account,...,wouldn,write,wrong,yeah,year,yet,you,your,yubo,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.127443,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.13998,0.0,0.0,0.172011,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.125648,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
from sklearn.svm import LinearSVC

X = matrix
y = known.sexual

clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)

LinearSVC(class_weight='balanced')

In [27]:
X = vectorizer.transform(unknown.Review)

unknown['predicted'] = clf.predict(X)
unknown['predicted_proba'] = clf.decision_function(X)

In [28]:
unknown

Unnamed: 0,Rating,Review,source,racism,bullying,sexual,predicted,predicted_proba
0,5,It’s a great app to meet new people and chat in very satisfied with downloading this app i recommend this app if you like to chat or just to meet new people. And you can choose which country To find different users!,holla,,,,0.0,-0.596082
1,5,"Holla is an excellent app, where I get to know new people every time and even get to make new friends. I truly recommend this application to all people!",holla,,,,0.0,-1.015293
3,5,"Free to use app, meet people around the world.",holla,,,,0.0,-1.202139
4,5,I got this app and everything has been different. I’ve met so many interesting people. From around the world. I was recently reunited with my high school girlfriend. We’re getting married. I met and married The love of my Life thanks to Holla. Thanks Holla!!!!!,holla,,,,0.0,-0.937766
5,5,You make good friends here,holla,,,,0.0,-0.975581
...,...,...,...,...,...,...,...,...
56044,5,"Fun app glad I got it for free, would be awesome if you could save logs either via picture or text. Keep up the good work.",chat-for-strangers,,,,0.0,-0.864825
56045,1,Can't even open the app. Fix it,skout,,,,0.0,-1.321490
56046,3,Great but drops convo all tge time :(,chat-for-strangers,,,,0.0,-0.528206
56047,5,Works just like the service it connects to.,chat-for-strangers,,,,0.0,-1.388547


In [29]:
unknown[unknown.predicted == 1].head()

Unnamed: 0,Rating,Review,source,racism,bullying,sexual,predicted,predicted_proba
151,5,"This app is very cool! You get to talk to many many different people from all over the world ! France,Germany,Italy,Brazil,Mexico and the list goes on and on. Make sure to always be nice and respectful and don’t show your private parts that’s not allowed you will get banned . That’s disgusting a...",holla,,,,1.0,0.22452
272,4,"I really like the app, but there is so many 13 year olds on that app and there’s girls showing their private parts on camera.",holla,,,,1.0,0.435016
285,5,"I like the concept and idea of the app, but the guys on here are really\nGross and mostly just catfishes (posing as another person) and it’s highly disappointing. Also I try to connect with guys and I’ll get girls who say they’re men? Plus on nearby mode I have most of my luck finding cute, nice...",holla,,,,1.0,0.223257
286,3,Literally all I match with is guys. NONSTOP. there needs to be a little bit more variety than just dudes back to back when it's set to girls and guys with no girls.,holla,,,,1.0,0.006054
340,3,Hi guys ok so my problem with this app is that I wanna 2p with my friends and with the new features idk how to do that anymore. Did you guys delete the 2p mode?if not how do I use the 2p mode again?,holla,,,,1.0,0.143846


In [30]:
# The top 1000 most likely creepy dude reviews

creepy_df = unknown.sort_values(by='predicted_proba',
                                ascending=False).head(1000)
creepy_df.head()

Unnamed: 0,Rating,Review,source,racism,bullying,sexual,predicted,predicted_proba
55605,4,Like a small thing that's pink/ blue to show whether your m or f. That would be nice. And to all the guys out there... Put your dick back in your pants. I'm a guy but I don't creep on girls. There a thing called porn.,chat-for-strangers,,,,1.0,1.187417
19423,2,All the guys on here ever ask for is nudes like I don't want to send my nudes to you,skout,,,,1.0,1.159414
11002,1,To many perverts and all they ask for is nudes🙄,chat-for-strangers,,,,1.0,1.081842
30386,1,Six 18yd men asked for nudes I’m 13,holla,,,,1.0,1.068405
16547,2,Just a bunch of horny guys wanting nudes,holla,,,,1.0,1.063398


### Using a fine-tuned language model

The modern update to this might use [HuggingFace AutoTrain](https://huggingface.co/autotrain) to create a custom model. It will (potentially) be more effective than your old-fashioned machine learning model, with fewer parameters to tweak.

In [37]:
from transformers import pipeline

creepy_pipeline = pipeline(model="wendys-llc/creepy-wapo")
data = [
    "I love the app, talking to people is fun",
    "Be careful talking to men, they all want nudes :("
]

creepy_pipeline(data)

[{'label': '0.0', 'score': 0.998849630355835},
 {'label': '1.0', 'score': 0.8436581492424011}]

### Using zero-shot classification with GPT

The *most* advanced method is to [just ask GPT](https://chat.openai.com/). This is called zero-shot classification because it doesn't need any examples!

In [34]:
from langchain.chat_models import ChatOpenAI

# You'll need your own OpenAI GPT API key! 
# https://platform.openai.com/apps
# API_KEY = "sk-XXXXX"

# Faster/cheaper
MODEL = 'gpt-3.5-turbo'

# Better results (I'm impatient, so we're using turbo!)
# MODEL = 'gpt-4'

llm = ChatOpenAI(openai_api_key=API_KEY, model_name=MODEL)

Here is an example of talking to GPT using Python code.

In [35]:
response = llm.predict("Give me a recipe for chocolate-chip cookies")
print(response)

Sure! Here is a classic recipe for chocolate chip cookies:

Ingredients:
- 1 cup (2 sticks) unsalted butter, softened
- 1 cup granulated sugar
- 1 cup packed brown sugar
- 2 large eggs
- 1 teaspoon pure vanilla extract
- 3 cups all-purpose flour
- 1 teaspoon baking soda
- 1/2 teaspoon salt
- 2 cups chocolate chips (semi-sweet or milk chocolate)

Instructions:
1. Preheat your oven to 350°F (175°C). Line baking sheets with parchment paper or silicone baking mats.

2. In a large mixing bowl, cream together the softened butter, granulated sugar, and brown sugar until light and fluffy. This can be done by hand or with an electric mixer.

3. Add the eggs one at a time, beating well after each addition. Stir in the vanilla extract.

4. In a separate bowl, whisk together the flour, baking soda, and salt. Gradually add the dry ingredients to the wet ingredients, mixing until just combined.

5. Fold in the chocolate chips until evenly distributed throughout the dough.

6. Using a cookie scoop or

Here is an example of zero-shot classification

In [36]:
prompt = """
Categorize the following text as being about ENVIRONMENT, GUN CONTROL,
or IMMIGRATION. Respond with only the category.

Text: A Bill to Regulate the Sulfur Emissions of Coal-Fired Energy
Plants in the State of New York.
"""

response = llm.predict(prompt)
print(response)

ENVIRONMENT


Normally you would use this for a whole lot of different bills, so it would be best to design a template that you can fill in text for.

In [40]:
template = """
Categorize the following text as being about ENVIRONMENT, GUN CONTROL,
or IMMIGRATION. Respond with only the category.

Text: {bill_text}
"""

bills = [
    "A Bill to Allow Additional Refugees In Upstate New York",
    "A Bill to Close Down Coal-fired Power Plants",
    "A Bill to Banning Assault Rifles at Public Events"
]

for bill in bills:
    prompt = template.format(bill_text=bill)
    response = llm.predict(prompt)
    print(bill, "is", response)

A Bill to Allow Additional Refugees In Upstate New York is IMMIGRATION
A Bill to Close Down Coal-fired Power Plants is ENVIRONMENT
A Bill to Banning Assault Rifles at Public Events is GUN CONTROL


## Summarization

Let's say we wanted to summarize [this story from the NYT](https://www.nytimes.com/2023/08/08/business/china-youth-unemployment.html) about youth unemployment in China. We have a few options!

In [54]:
text = open("nytimes-story.txt").read()
text[:200]

'At this year’s commencement ceremony for the Chongqing Metropolitan College of Science and Technology in southwestern China, the members of the graduating class did not receive the usual lofty message'

### Using a Hugging Face model to summarize

Using a Hugging Face model is free, fast and private. For example, we can use [this model originally created by Facebook](https://huggingface.co/facebook/bart-large-cnn), which is a [popular model for summarization](https://huggingface.co/models?pipeline_tag=summarization&sort=trending).

In [55]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

In [68]:
# We can't send the whole text! We're only sending the first half.

result = summarizer(text[:4000], max_length=300, min_length=30)
print(result)

[{'summary_text': 'China’s unemployment rate for 16- to 24-year-olds in urban areas hit a record 21.3 percent in June. The number of students enrolling in colleges and universities increased to 10.1 million in 2022 from 754,000 in 1992.'}]


### Using GPT to summarize

On the other hand, GPT results might be more expensive (and less private), but the quality will certainly be much higher.

In [51]:
template = """
Write a concise summary of the following text.

TEXT: {story_text}
"""

prompt = template.format(story_text=text)
response = llm.predict(prompt)

print(response)

China's record number of college graduates entering the job market is exacerbating the country's youth unemployment crisis, with the unemployment rate for 16- to 24-year-olds in urban areas reaching a record high of 21.3% in June. Government policymakers are pressuring colleges to do more to find jobs for graduates, with school administrators' job performance now tied to the employment rate of their students. The mismatch between the jobs graduates want and what is available, combined with economic downturns and government crackdowns on industries, has contributed to the issue. The problem of youth unemployment could have significant social and political consequences if not addressed properly.


We can use **prompt engineering** to customize our results. You can learn more at [Prompt Engineering for Developers](https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/)

In [52]:
template = """
Write a concise summary of the following text in bullet-point format.
Address topics as action items, and assume the reader knows the basic
facts of the situation.

TEXT: {story_text}
"""

prompt = template.format(story_text=text)
response = llm.predict(prompt)

print(response)

- Chinese college graduates face a bleak employment outlook due to a record number of graduates and a stagnant economy.
- China's youth unemployment rate hit a record 21.3% in June and is expected to rise further.
- Government policymakers are pressuring colleges to do more to find jobs for graduates.
- The problem of youth unemployment could have significant social and political consequences.
- Economic volatility, government crackdowns, and a mismatch between job expectations and availability contribute to the issue.
- Beijing has released policy initiatives and support measures to encourage private companies to add jobs.
- College administrators face pressure to meet employment mandates from the government.
- Extreme measures, such as fabricating job offers, are being taken by students and administrators to appease authorities.
- The pressure campaign on colleges is intensifying, leading to greater desperation among students and administrators.


### Summarizing longer texts

In [70]:
text = open("folktale.txt").read()
text[:300]

'RÁADÓ ÉS ANYICSKA.\nEgyszer, hol volt – hol nem volt, volt a világon egy király, a ki teljeséletében mindig a háborúban, harczban lakott. Már kilencz esztendő óta aháza tájékán se’ volt, azt se’ tudta hogy mije van otthon. Elhatároztahát hogy már csak akárhogy – mint teszi szerét, de haza megy, felir'

In [72]:
template = """
Write a concise summary of the following text.

TEXT: {story_text}
"""

# The below will give us an error
# prompt = template.format(story_text=text)
# response = llm.predict(prompt)

# print(response)

Instead, we need to split it up into several pieces and summarize them one at a time.

In [73]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader('folktale.txt', encoding='utf-8')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2500)
docs = text_splitter.split_documents(documents)
len(docs)

20

In [74]:
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain

prompt_template = """Write a concise summary of the following text.

TEXT: {text}


CONCISE SUMMARY IN ENGLISH:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(llm,
                             chain_type="map_reduce",
                             return_intermediate_steps=True,
                             map_prompt=PROMPT,
                             combine_prompt=PROMPT)

result = chain({"input_documents": docs}, return_only_outputs=True)

In [75]:
print(result['output_text'])

The text tells the story of Ráadó, who follows the instructions of a beautiful fairy to meet her at a lake. Ráadó sees three swans transform into fairy princesses, and falls in love with the youngest one. He steals her swan dress and asks her to marry him. She agrees, and they fly back to his father's castle. However, she warns him to hide their love so they won't be separated.


## Embeddings and semantic search

In [77]:
from sentence_transformers import SentenceTransformer
sentences = ["cat"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings[0][:25])

[ 0.03733039  0.0511619  -0.00030607  0.06020993 -0.11749442 -0.0142301
  0.10577624  0.02678623  0.02633771 -0.02570082 -0.02349038 -0.05955521
 -0.03021392  0.01632017 -0.02907013 -0.02168973 -0.06624991  0.00185665
 -0.02400624 -0.02846252 -0.04663163  0.04970483  0.00308297  0.00176273
 -0.0677575 ]


In [78]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings[0][:50])

[ 0.0676569   0.06349581  0.0487131   0.07930496  0.03744796  0.00265277
  0.03937485 -0.00709837  0.0593615   0.03153696  0.06009803 -0.05290522
  0.04060676 -0.02593078  0.02984274  0.00112689  0.07351495 -0.05038185
 -0.12238666  0.02370274  0.02972649  0.04247681  0.0256338   0.00199517
 -0.05691912 -0.02715985 -0.03290359  0.06602488  0.11900704 -0.04587924
 -0.07262138 -0.03258408  0.05234135  0.04505523  0.00825302  0.03670237
 -0.01394151  0.06539196 -0.02642729  0.00020634 -0.01366437 -0.03628108
 -0.0195043  -0.02897387  0.03942709 -0.08840913  0.00262434  0.01367143
  0.04830637 -0.03115652]


In [79]:
import pandas as pd

sentences = [
    "Molly ate a fish",
    "Jen consumed a carp",
    "I would like to sell you a house",
    "Я пытаюсь купить дачу", # I'm trying to buy a summer home
    "J'aimerais vous louer un grand appartement", # I would like to rent a large apartment to you
    "This is a wonderful investment opportunity",
    "Это прекрасная возможность для инвестиций", # investment opportunity
    "C'est une merveilleuse opportunité d'investissement", # investment opportunity
    "これは素晴らしい投資機会です", # investment opportunity
    "野球はあなたが思うよりも面白いことがあります", # baseball can be more interesting than you think
    "Baseball can be more interesting than you'd think"
]

In [80]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

In [81]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities exactly the same as we did before!
similarities = cosine_similarity(embeddings)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)

Unnamed: 0,Molly ate a fish,Jen consumed a carp,I would like to sell you a house,Я пытаюсь купить дачу,J'aimerais vous louer un grand appartement,This is a wonderful investment opportunity,Это прекрасная возможность для инвестиций,C'est une merveilleuse opportunité d'investissement,これは素晴らしい投資機会です,野球はあなたが思うよりも面白いことがあります,Baseball can be more interesting than you'd think
Molly ate a fish,1.0,0.526053,0.025476,0.098335,0.020435,-0.065293,0.035801,-0.062506,0.027358,0.017622,0.013324
Jen consumed a carp,0.526053,1.0,0.044178,0.035044,-0.018194,-0.004438,-0.078566,-0.011418,0.090357,0.131507,0.010598
I would like to sell you a house,0.025476,0.044178,1.0,0.154773,0.083555,0.386736,0.017175,-0.006744,0.010857,0.02551,-0.001803
Я пытаюсь купить дачу,0.098335,0.035044,0.154773,1.0,0.159519,0.064379,0.462397,0.09211,0.314708,0.327675,-0.119706
J'aimerais vous louer un grand appartement,0.020435,-0.018194,0.083555,0.159519,1.0,0.032253,0.365505,0.566635,0.172406,0.110118,0.007807
This is a wonderful investment opportunity,-0.065293,-0.004438,0.386736,0.064379,0.032253,1.0,-0.030322,0.21223,0.023889,-0.002844,0.114933
Это прекрасная возможность для инвестиций,0.035801,-0.078566,0.017175,0.462397,0.365505,-0.030322,1.0,0.282414,0.267571,0.285873,-0.039264
C'est une merveilleuse opportunité d'investissement,-0.062506,-0.011418,-0.006744,0.09211,0.566635,0.21223,0.282414,1.0,0.292651,0.187989,0.021442
これは素晴らしい投資機会です,0.027358,0.090357,0.010857,0.314708,0.172406,0.023889,0.267571,0.292651,1.0,0.577265,-0.101104
野球はあなたが思うよりも面白いことがあります,0.017622,0.131507,0.02551,0.327675,0.110118,-0.002844,0.285873,0.187989,0.577265,1.0,-0.107835


In [82]:
model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2')
embeddings = model.encode(sentences)

In [83]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities exactly the same as we did before!
similarities = cosine_similarity(embeddings)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)

Unnamed: 0,Molly ate a fish,Jen consumed a carp,I would like to sell you a house,Я пытаюсь купить дачу,J'aimerais vous louer un grand appartement,This is a wonderful investment opportunity,Это прекрасная возможность для инвестиций,C'est une merveilleuse opportunité d'investissement,これは素晴らしい投資機会です,野球はあなたが思うよりも面白いことがあります,Baseball can be more interesting than you'd think
Molly ate a fish,1.0,0.358347,0.05834,0.145439,-0.024103,-0.070145,-0.075333,-0.073496,-0.111467,-0.025614,-0.012353
Jen consumed a carp,0.358347,1.0,0.059195,0.190241,-0.001941,-0.024359,-0.024816,-0.023295,-0.087019,0.040799,0.054475
I would like to sell you a house,0.05834,0.059195,1.0,0.418692,0.642746,0.081795,0.118611,0.067805,0.04256,0.144491,0.117145
Я пытаюсь купить дачу,0.145439,0.190241,0.418692,1.0,0.351605,0.120679,0.184644,0.144633,0.115598,0.050505,0.02481
J'aimerais vous louer un grand appartement,-0.024103,-0.001941,0.642746,0.351605,1.0,0.203307,0.238716,0.204762,0.195163,0.201317,0.133963
This is a wonderful investment opportunity,-0.070145,-0.024359,0.081795,0.120679,0.203307,1.0,0.953561,0.964282,0.945246,0.062618,0.097885
Это прекрасная возможность для инвестиций,-0.075333,-0.024816,0.118611,0.184644,0.238716,0.953561,1.0,0.968368,0.944719,0.084221,0.103252
C'est une merveilleuse opportunité d'investissement,-0.073496,-0.023295,0.067805,0.144633,0.204762,0.964282,0.968368,1.0,0.959357,0.086458,0.113237
これは素晴らしい投資機会です,-0.111467,-0.087019,0.04256,0.115598,0.195163,0.945246,0.944719,0.959357,1.0,0.091451,0.078595
野球はあなたが思うよりも面白いことがあります,-0.025614,0.040799,0.144491,0.050505,0.201317,0.062618,0.084221,0.086458,0.091451,1.0,0.866255


### Searching across a database

In [84]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader('book.txt')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
len(docs)

456

In [86]:
docs[10]

Document(lc_kwargs={'page_content': 'Nagyon elszomorodott e beszéden a királyfi, elébb nem is akarta hinni,de az inas úgyséval is erősítette, gondolta hát magában, hogy most nemtölti az időt, majd visszafelé jövet haza híjja őket.\nMentek aztán tovább, beértek egy nagy erdőbe, ott az út egy helyenkétfelé vált, egyik se tudta a járást, elkezdtek tanácskozni, hogy merremenjenek. A mint ott tanácskoznak, egyszer, – mintha csak a föld alólbútt volna ki, vagy az égből cseppent volna, – ott termett egy szép nagyróka. A királyfi a mint meglátta, nyult a nyila után hogy majd meglövi,hát, uramfia, – tán nem is hinnék kendtek, ha nem mondanám, – megszólalta róka emberi nyelven:\n– No felséges királyfi hát eltévedtek, vagy min tanácskoznak?\n– Hiszen nem tévedtünk épen el – felelt neki a királyfi, – hanem aztcsakugyan nem tudjuk, hogy e közül a két út közül melyik visz aVerestengerhez. Hát miért kérded?\n– Csak azért, mert én útba tudom igazitani a királyfit, tudom a járástezen a tájékon. De hát 

In [88]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

embeddings = HuggingFaceEmbeddings(model_name='paraphrase-multilingual-MiniLM-L12-v2')
docsearch = Chroma.from_documents(docs, embeddings)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.


In [89]:
scores = embeddings.embed_documents(["What did Zsuzska steal from the devil?"])[0]
len(scores)

384

In [90]:
print(scores[:20])

[-0.46812501549720764, 0.47161218523979187, -0.39475440979003906, 0.18969321250915527, 0.08756688982248306, 0.04914027079939842, 0.6678051948547363, 0.24234464764595032, 0.011556974612176418, 0.24045951664447784, 0.15715603530406952, 0.04403669759631157, 0.25661030411720276, -0.12375714629888535, -0.5067397356033325, 0.053942807018756866, 0.06712772697210312, 0.13114140927791595, -0.17556653916835785, 0.2375480830669403]


In [92]:
# k=1 because we only want one result
docsearch.similarity_search("What did Zsuzska steal from the devil?", k=4)

[Document(lc_kwargs={'page_content': 'Hiába tagadta szegény Zsuzska, nem használt semmit, elindult hát nagyszomorúan. Épen éjfél volt, mikor az ördög házához ért, aludt az ördögis, a felesége is. Zsuzska csendesen belopódzott, ellopta a tenger-ütőpálczát, avval bekiáltott az ablakon.\n– Hej ördög, viszem ám már a tenger-ütő pálczádat is.\n– Hej kutya Zsuzska, megöletted három szép lyányomat, elloptad atenger-lépő czipőmet, most viszed a tenger-ütő pálczámat, de majdmeglakolsz te ezért.\nUtána is szaladt, de megint csak a tengerparton tudott közel jutnihozzá, ott meg Zsuzska megütötte a tengert a tenger-ütő pálczával,kétfelé vált előtte, utána meg összecsapódott, megint nem foghatta megaz ördög. Zsuzska ment egyenesen a királyhoz.\n– No felséges király, elhoztam már a tengerütő pálczát is.', 'metadata': {'source': 'book.txt'}}, page_content='Hiába tagadta szegény Zsuzska, nem használt semmit, elindult hát nagyszomorúan. Épen éjfél volt, mikor az ördög házához ért, aludt az ördögis, a fe

## Document-based question-and-answer

We can then use the related documents to answer questions. The example below sends the top few results to GPT along with our question. This is called **document-based question-and-answer with semantic search**. Be careful, though, it isn't perfect!

In [107]:
from langchain.chat_models import ChatOpenAI

# You'll need your own OpenAI GPT API key!
# https://platform.openai.com/apps
# API_KEY = "sk-XXXXX"

# Faster/cheaper
MODEL = 'gpt-3.5-turbo'

# Better results (I'm impatient, so we're using turbo!)
# MODEL = 'gpt-4'

llm = ChatOpenAI(openai_api_key=API_KEY, model_name=MODEL, temperature=0)

In [108]:
from langchain.chains import RetrievalQAWithSourcesChain

chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever()
)

In [109]:
query = "What did Zsuzska steal from the devil?"

result = qa.run(query)
print(result)

Zsuzska stole the tenger-ütő pálczát (a special rod) from the devil.


In [110]:
query = "What did Zsuzska steal from the devil? Be sure to name everything!"

result = qa.run(query)
print(result)

Zsuzska stole the tenger-ütő pálczát (a staff that can split the sea), the arany fej káposztát (a golden cabbage head), and the tenger-lépő czipőt (a pair of shoes that allow walking on water).


## Transcription

We can use [Whisper](https://github.com/openai/whisper) to transcribe audio.

In [115]:
import whisper

In [113]:
%%time

model = whisper.load_model("tiny")

result = model.transcribe("6313.mp3")
result['text']



" Okay, I was wondering if you could tell me a little bit about the program, what exactly goes on and... Okay, well basically it's an intensive program designed to last four weeks, in session four weeks long, although we do have two sessions in 1996 that will be two weeks long. But our general four week session involves seven hours of contact each day of Spanish. It involves starting at 9 o'clock in the morning with three hours of grammar instruction, and then it's followed immediately by an hour of group conversation with the same instructors, same class, but we have extensive gardens here at Institute and so students will usually know how to a garden setting because it's a bit more comfortable. And then after a break for lunch, students return and we have a complete flip of instruction and that instead of textbook and clinical type instruction, we have hands-on and we have we offer workshops and me-hey-backstrap weaving, regional cooking of Wahaka at Zompusnell pottery, and if we hav

In [114]:
%%time

model = whisper.load_model("base")

result = model.transcribe("6313.mp3")
result['text']



CPU times: user 1min 4s, sys: 9.13 s, total: 1min 14s
Wall time: 47.2 s


" Okay, I was wondering if you could tell me a little bit about the program. What exactly goes on? Okay, well basically it's an intensive program designed to last four weeks, each session is four weeks long, although we do have two sessions in 1996 that will be two weeks long. But our general four weeks session involves seven hours of contact each day of Spanish. It involves starting at nine o'clock in the morning with three hours of grammar instruction. And then it's followed immediately by an hour of group conversation with the same instructor, same class, but we have extensive gardens here at the Institute and so students will usually go out to a garden setting because it's a little more comfortable. And then after a break for lunch, students return and we have a complete flip of instruction in that instead of textbook and clinical type instruction, we have hands on and we offer workshops in Mejibak's drop leaving, regional cooking of Wahaka at Donpas-Nal pottery. And if we have lar

## What do you want to try to do?