# Pipeline 1, approach 1

In this notebook we show our various attempts at pre-trained models

<img src="../reports/illustrations/pipeline1.png" width=800 />

<img src="../reports/illustrations/pipeline2_approach1.jpeg" width=800 />

## GPT approach using OpenAI's interface

In [1]:
!pip3 install openai

Defaulting to user installation because normal site-packages is not writeable
Collecting openai
  Downloading openai-0.27.2-py3-none-any.whl (70 kB)
[K     |████████████████████████████████| 70 kB 8.6 MB/s  eta 0:00:01
Installing collected packages: openai
Successfully installed openai-0.27.2


In [10]:
import json
import pandas as pd

import openai

Load file and slice a smaller chunk

In [11]:
# Opening JSON file
f = open('../data/raw/CMS_2010_to_June_2022_ENGLISH.json')
  
# returns JSON object as 
# a dictionary
data = json.load(f)

df = pd.DataFrame.from_dict(data)

print(df.head())

del data

df['lastModifiedDate'] = pd.to_datetime(df['lastModifiedDate'])

df = df.set_index(['lastModifiedDate'])
start_date = '2019-01-01'
end_date = '2022-01-01'
df_small = df.loc[start_date:end_date]

df_small.reset_index(inplace=True)
del df

         id                                               name  \
0  16489913                 UN imposes sanctions on DRC rebels   
1  16489912            Catholic Church abuse hotline goes cold   
2  16489903  "Fiscal cliff" tax hikes go into effect in the US   
3  16490025  Kim seeks reconciliation with South, economic ...   
4  16490029  Senate-approved fiscal deal faces House consid...   

                    shortTitle  \
0      UN sanctions DRC rebels   
1                 Disconnected   
2  US goes over "fiscal cliff"   
3    A rare New Year's address   
4   US House mulls fiscal deal   

                                                text  \
0  <p>\n\tThe UN Security Council has sanctioned ...   
1  <p>\n\tFor two and a half years, the counselin...   
2  <p>\n\tAs the clock struck midnight in Washing...   
3  <p>\n\tKim Jong Un, who came to office just ov...   
4  <p>\n\tLess than two hours after the US had of...   

                                              teaser  \
0  A 



In [15]:
df_small = df_small[df_small['thematicFocusCategory'].notna()] #Drop articles with no category

Replace mixed categories with first-level categories

NOTE: Categories '[Miscellaneous]' and 'Video News' are not considered

In [31]:
categories = pd.read_csv('../data/processed/categories.csv')
categories.head()

Unnamed: 0,first_level,second_level,is_primary
0,Cars and Transportation,Cars and Transportation,yes
1,Education,Education,yes
2,Learning German,Learning German,yes
3,Digital World,Digital World,yes
4,History,History,yes


In [32]:
to_replace = categories.loc[categories['is_primary'] == 'no']   #We're only replacing secondary categories
to_replace

Unnamed: 0,first_level,second_level,is_primary
10,Culture,Architecture,no
11,Culture,Design,no
12,Culture,Film,no
13,Culture,Arts,no
14,Culture,Literature,no
15,Culture,Music,no
16,Culture,Dance,no
17,Culture,Theater,no
25,Politics,Conflicts,no
26,Politics,Terrorism,no


In [33]:
to_replace = dict(zip(to_replace['second_level'], to_replace['first_level']))
to_replace

{'Architecture': 'Culture',
 'Design': 'Culture',
 'Film': 'Culture',
 'Arts': 'Culture',
 'Literature': 'Culture',
 'Music': 'Culture',
 'Dance': 'Culture',
 'Theater': 'Culture',
 'Conflicts': 'Politics',
 'Terrorism': 'Politics',
 'Corruption': 'Law and Justice',
 'Crime': 'Law and Justice',
 'Rule of Law': 'Law and Justice',
 'Press Freedom': 'Law and Justice',
 'Diversity': 'Human Rights',
 'Freedom of Speech': 'Human Rights',
 'Equality': 'Human Rights',
 'Soccer': 'Sports',
 'Trade': 'Business',
 'Globalization': 'Business',
 'Food Security': 'Business'}

Make categories into strings and replace

In [34]:
df_small['categoryStrings'] = df_small['thematicFocusCategory'].apply(func = lambda dic: dic['name'])
df_small[['thematicFocusCategory', 'categoryStrings']].head()

Unnamed: 0,thematicFocusCategory,categoryStrings
0,{'name': 'Science'},Science
1,{'name': 'Law and Justice'},Law and Justice
2,{'name': 'Politics'},Politics
3,{'name': 'Crime'},Crime
4,{'name': 'Politics'},Politics


In [35]:
df_small['cleanCategoryStrings'] = df_small['categoryStrings'].replace(to_replace)
df_small[['categoryStrings', 'cleanCategoryStrings']]

Unnamed: 0,categoryStrings,cleanCategoryStrings
0,Science,Science
1,Law and Justice,Law and Justice
2,Politics,Politics
3,Crime,Law and Justice
4,Politics,Politics
...,...,...
33837,Politics,Politics
33838,Crime,Law and Justice
33839,Health,Health
33840,Politics,Politics


Load OpenAI GPT

In [8]:
with open('WL_API.txt') as f:       #Open and load the private key. This can be generated on OpenAI's website
    key = f.read()
    openai.api_key = key

openai.Model.list()         #This will throw error if key is incorrect

<OpenAIObject list at 0x7f9fb1b6a530> JSON: {
  "data": [
    {
      "created": 1649358449,
      "id": "babbage",
      "object": "model",
      "owned_by": "openai",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sampling": true,
          "allow_search_indices": false,
          "allow_view": true,
          "created": 1669085501,
          "group": null,
          "id": "modelperm-49FUp5v084tBB49tC4z8LPH5",
          "is_blocking": false,
          "object": "model_permission",
          "organization": "*"
        }
      ],
      "root": "babbage"
    },
    {
      "created": 1649359874,
      "id": "davinci",
      "object": "model",
      "owned_by": "openai",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sa

### Predict categories from dirty keywordStrings
#### Zero-shot learning

In [82]:
from openai.embeddings_utils import cosine_similarity

def get_embedding(text, model="text-embedding-ada-002"):
    text = text.replace("\n", " ")
    return openai.Embedding.create(input = text, model=model)['data'][0]['embedding']

labels = categories[categories['is_primary']=='yes']['first_level'].tolist()
label_embeddings = [get_embedding(label, model = "text-embedding-ada-002") for label in labels]
 
def label_score(review_text, label_embeddings):
    
    review_embedding = get_embedding(review_text, model='text-embedding-ada-002')
    
    return [cosine_similarity(review_embedding, label_embeddings[0]),
            cosine_similarity(review_embedding, label_embeddings[1]),
            cosine_similarity(review_embedding, label_embeddings[2]),
            cosine_similarity(review_embedding, label_embeddings[3]),
            cosine_similarity(review_embedding, label_embeddings[4]),
            cosine_similarity(review_embedding, label_embeddings[5]),
            cosine_similarity(review_embedding, label_embeddings[6]),
            cosine_similarity(review_embedding, label_embeddings[7]),
            cosine_similarity(review_embedding, label_embeddings[8]),
            cosine_similarity(review_embedding, label_embeddings[9]),
            cosine_similarity(review_embedding, label_embeddings[10]),
            cosine_similarity(review_embedding, label_embeddings[11]),
            cosine_similarity(review_embedding, label_embeddings[12]),
            cosine_similarity(review_embedding, label_embeddings[13]),
            cosine_similarity(review_embedding, label_embeddings[14]),
            cosine_similarity(review_embedding, label_embeddings[15]),
            cosine_similarity(review_embedding, label_embeddings[16]),
            cosine_similarity(review_embedding, label_embeddings[17]),
            cosine_similarity(review_embedding, label_embeddings[18]),
            cosine_similarity(review_embedding, label_embeddings[19]),
            cosine_similarity(review_embedding, label_embeddings[20]),
            cosine_similarity(review_embedding, label_embeddings[21]),
            cosine_similarity(review_embedding, label_embeddings[22]),
            cosine_similarity(review_embedding, label_embeddings[23]),
            cosine_similarity(review_embedding, label_embeddings[24])
           ]

In [83]:
def label_categories_emb(keywords):
    label_scores = label_score(" ".join(keywords), label_embeddings)
                               
    max_index = label_scores.index(max(label_scores))
    label = labels[max_index]
    return label

Assign categories using GPT's zero-shot method
The free version only works with 60 requests/min, so working with smaller data
60 requests take 13 seconds

In [91]:
df_tiny = df_small.head(60)
df_tiny['categoriesGPTembedding'] = df_tiny['keywordStrings'].apply(func = lambda x: label_categories_emb(x))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [95]:
print('Agreement between GPT embedding and DW category:')
sum(df_tiny['categoriesGPTembedding'] == df_tiny['cleanCategoryStrings'])/60

0.36666666666666664

On this limited data, the accuracy is about 37%

#### Prompt engineering

In [98]:
prompt1 = "I will give you a list of keywords. Assign them to one of the following categories: "
prompt2 = ". Answer with one word - the category that best matches given keywords. Keywords: "
categories = ", ".join(labels)

prompt1 + categories + prompt2 +  ", ".join(df_tiny['keywordStrings'][0])

'I will give you a list of keywords. Assign them to one of the following categories: Cars and Transportation, Education, Learning German, Digital World, History, Society, Health, Religion, Catastrophe, Culture, Lifestyle, Media, Migration, Nature and Environment, Climate, Offbeat, Politics, Law and Justice, Human Rights, Travel, Sports, Technology, Innovation, Business, Science. Answer with one word - the category that best matches given keywords. Keywords: NASA, OSIRIS-REx, Bennu, asteroid'

In [109]:
def label_categories_chat(keywords):
  text = openai.Completion.create(
  model="text-davinci-003",
  prompt=prompt1 + categories + prompt2 +  ", ".join(keywords),
  max_tokens=300,
  temperature=0)
  return text['choices'][0]['text'].strip('\n')

This code takes 26s to run with 60 articles and 300 tokens.

In [110]:
df_tiny['categoriesGPTchat'] = df_tiny['keywordStrings'].apply(func = lambda x: label_categories_chat(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [115]:
df_tiny[['categoriesGPTchat', 'cleanCategoryStrings']]

Unnamed: 0,categoriesGPTchat,cleanCategoryStrings
0,Science,Science
1,Migration,Law and Justice
2,Politics,Politics
3,Culture,Law and Justice
4,Politics,Politics
5,Politics,Politics
6,Politics,Politics
7,Law and Justice,Business
8,Human Rights,Culture
9,Nature and Environment,Catastrophe


In [120]:
print('Agreement between GPT embedding and GPT prompt:')
print(sum(df_tiny['categoriesGPTchat'] == df_tiny['categoriesGPTembedding'])/60)

print('Agreement between GPT prompt and DW category:')
print(sum(df_tiny['categoriesGPTchat'] == df_tiny['cleanCategoryStrings'])/60)

Agreement between GPT embedding and GPT prompt:
0.38333333333333336
Agreement between GPT prompt and DW category:
0.5833333333333334


### Few-shot learning with examples of categories in the prompt