# Finding Conan
We want to use Chatgpt4 to retrive information out of our scraped data - Polieimeldungen https://www.berlin.de/polizei/polizeimeldungen/

We want the following information:
1. When did the crime happened
2. What crime happened
3. Where did it happened


In [1]:
# Install in terminal
# pip install openai
# pip install --upgrade openai
# npm install openai

# **Import the necessary packages**

In [2]:
import os
import openai
import pandas as pd
import requests
import json
import shutil 
import time
import glob
import numpy as np
import pandas as pd
from datetime import datetime

# **Set Parameters**

In [3]:
#Input
current_batch_name = 'second-batch-2467-files-to-20220516'
base_path = os.path.join(os.path.dirname(os.path.dirname(os.getcwd())), 'raw-data', 'unstructured-data')
batch_name = os.path.join(current_batch_name, 'categorized')
cat_type = 'one-case'
batch_num = 2

#Output
output_dir_path = os.path.join(os.path.dirname(os.path.dirname(os.getcwd())), 'raw-data', 'structured-data', 'csv-output')


# **Set API_Key secret**

In [4]:
# Get the path where the credential is
path_cred = os.path.join(os.path.dirname(os.path.dirname(os.getcwd())), 'credentials')

# open the credential
with open(os.path.join(path_cred,'secrets.json')) as secrets_file:
    secrets = json.load(secrets_file)
    api_key = secrets['ir_api_key']

In [5]:
path_cred

'/Users/ellenlee/code/hclpush/finding-conan/credentials'

# load single json file

In [6]:
file_name = 'pressemitteilung.1085404.json'
f = open(os.path.join(base_path, batch_name, cat_type, file_name))
case = json.load(f)

del case['subtitle']

In [7]:
case

{'title': 'Zwei Verletzte nach Angriff mit Messer - Festnahme',
 'place': 'Pankow',
 'date': '16.05.2021',
 'data': [{'number': '1071',
   'description': 'In der vergangenen Nacht wurden in Prenzlauer Berg zwei Männer mit einem Messer verletzt. Ersten Erkenntnissen zufolge hielten sich die beiden 18-Jährige gegen 23.40 Uhr an einem Hauseingang in der Danziger Straße auf, als ein Unbekannter sie angesprochen und nach der Uhrzeit gefragt haben soll. Im weiteren Verlauf soll der junge Mann das Duo mit einem Messer bedroht und Geld gefordert haben. Als die beiden Bedrohten den Ort verlassen wollten, soll der Unbekannte den einen der beiden mit dem Messer am Arm verletzt und seinen Begleiter Pfefferspray ins Gesicht gesprüht haben. Anschließend flüchtete der Angreifer über die Schönhauser Allee in Richtung Torstraße.\n    Alarmierte Rettungskräfte brachten den 18-Jährigen mit einer Stichverletzung am Oberarm zur stationären Behandlung in ein Krankenhaus. Seine Begleitperson erlitt eine Auge

# Create Important Functions

In [8]:
# Current prompt
prompt = """
You are crime analyzing assistant. The assistant is helpful, clever, and gives accurate answers.

You define types of crimes according to following categories:
Homicide = The act of unlawfully causing the death of another person.
Hate Crime - Disability = Anti-Mental Disability, Anti-Physical Disability
Hate Crime - Gender = Anti-Male, Anti-Female
Hate Crime - Gender Identity = Anti-Transgender, Anti-Gender Non-Conforming
Hate Crime - Religious = religious hate crimes target a victim based on  their theological faith, or lack of faith, for example, Anti-Jewish, Anti-Christian or Anti-Muslim or against
Hate Crime - Sexual orientation = Anti-Bisexual, Anti-Gay (Male), Anti-Heterosexual, Anti-Lesbian, Anti-Lesbian, Gay, Bisexual, or Transgender (Mixed Group)
Hate Crime - Racial/Ethnicity = Anti-American Indian or Alaska Native, Anti-Arab, Anti-Asian, Anti-Black or African American, Anti-Hispanic or Latino, Anti-Multiple Races, Group, Anti-Native Hawaiian or Other Pacific Islander, Anti-Other Race/Ethnicity/Ancestry, Anti-White
Property Crimes = Offenses that involve unlawful interference with someone else's property, such as theft, burglary, or vandalism.
Verbal Abuse = Use of offensive, derogatory, or threatening language with the intention to harm or intimidate another person verbally.
Property Damage = Damage or destruction of real or tangible personal property, 
Drug Offenses = Violations of laws related to the possession, distribution, manufacturing, or use of illegal substances or controlled substances without proper authorization.
General Assault = Physical or verbal attacks on another person that result in causing harm or fear.
Sexual Assault = Non-consensual sexual acts or behavior that involve physical or psychological harm or coercion
Sexual Harassment = Unwanted sexual advances, comments, or conduct that creates a hostile or uncomfortable environment for the victim
Unclassified = Cases or incidents that do not fit into any specific predefined category.

Read the following text of a crime that happened in Berlin, Germany. Give information about the crime in the form of a key-value pair. List the type of crime described in the text according to your definition. If multiple crimes occurred add all types of crimes that are directly described in the text. List the location as accurately as possible. The terms "straße", "weg" or "Kreuzung" alone or as part of a word indicate an accurate location.
Also list the year, date, and time of the crime. List the sex of the victims and the sex of the offender. List how many victims and how many offenders were involved.  If have no clear answer you response with "unknown".

Here is one example for the following crime and the desired output:

Junge Männer mit Messer verletzt
Polizeimeldung vom 13.05.2023
Mitte
Nr. 0749
Gestern Abend zeigten ein Jugendlicher und zwei junge Männer in der Alex-Wache in Mitte Körperverletzungen, begangen aus einer Gruppe, an. Die beiden 20-Jährigen und ihr 17 Jahre alter Begleiter gaben an, dass sie gegen 21.30 Uhr auf dem Alexanderplatz in Richtung S-Bahnhof unterwegs waren, als sie plötzlich aus einer Gruppe von sechs bis acht Männern heraus zunächst angepöbelt und dann körperlich attackiert wurden. Dabei erlitt einer der 20-Jährigen eine stark blutende Schnittverletzung im Gesicht und sein gleichaltriger Bekannter eine oberflächliche Stichverletzung am Rücken. Der am Rücken Verletzte wurde ambulant behandelt, der im Gesicht Verletzte stationär aufgenommen. Einsatzkräfte stellten kurz nach der Anzeigenerstattung in Höhe der Sankt Marienkirche zwei 19-jährige Männer fest, die nach Angaben des Trios zu der angreifenden Gruppe gehörten. Die Kräfte nahmen die beiden Tatverdächtigen fest und führten sie einer erkennungsdienstlichen Behandlung zu. 

Year: 2023
Date: 13.05
Time:  21.30 Uhr 
Type of Crime: General Assault, Verbal Abuse
Location: Alexanderplatz 
Victim's Sex: Male
Offender's Sex: Male
Number of Victims: 3
Number of Offenders: 6-8

Please do this for the following crime:
"""

In [9]:
def chatGPT(prompt, text, api_key):
    url = "https://api.openai.com/v1/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}",
    }
    prompt = { 
        "model":"text-davinci-003",
        "prompt": prompt + str(text),
        "temperature":0.22,
        "max_tokens":1788,
        "top_p":0.46,
        "frequency_penalty":0,
        "presence_penalty":0
    }
    response = requests.post(url, headers=headers, json=prompt)
#     print('response: ')
#     print(response)
    json_response = response.json()  #TODO
    print(f'======== Response.json(): {json_response} ========')
    text_output = json_response['choices'][0]['text']
#     try:
#     text_output = handle_response(json_response)
#     except: 
#         print("!!!!======== handle_response Function broke =======!!!!")
    
    return text_output

In [10]:
def handle_response(json_response):
    num_retry = 0
    num_retry_limit = 5
    
    if "error" in json_response:
        while num_retry < num_retry_limit:
            print(f"!!!!======== Server error occured. Retrying {num_retry+1}th time in 5 seconds... =======!!!!")
            time.sleep(5)  # Wait for 5 seconds
            print(f"!!!!======== Retrying {num_retry+1}th time now... =======!!!!")
            new_response = requests.post(url, headers=headers, json=prompt)
            new_json_response = new_response.json()
            if "error" in new_json_response: 
                num_retry += 1
#             elif "error" not in new_json_response:
#                 text_output = new_json_response['choices'][0]['text']
#                 break
            else:
                print(f"======== Retried {num_retry+1}th time, this time it works =======")
                text_output = new_json_response['choices'][0]['text']
                break

    else:
        text_output = json_response['choices'][0]['text']
        
    return text_output

In [11]:
def label_cases_with_gpt(num_case, base_path, batch_name, files_gen, output_dict):
    """
    Calls openai api to label the raw scraped JSON files on criminal cases
    Outputs a dictionary with keys being an unique identifier (combination of index and original case number) 
        and values being the listed case info
    """
    for index, file_path in enumerate(files_gen[:num_case]): # DEBUG
        data = open(file_path)# Get labeling from GPT API
        case = json.load(data)
        del case['subtitle'] # This does not include relevant info
        
        file_name = file_path.split('/')[-1]
        case_id = str(file_name[-12:-5])
        print(f'======== Working on labeling case number {index+1} with GPT: {case_id} ========')
        
        try: 
            raw_gpt_output = chatGPT(prompt, case, api_key)
            output_dict[case_id] = raw_gpt_output

        #Move file to folder
            destination_dir_path = os.path.join(base_path, batch_name, 'info-extracted-cases')
            if not os.path.exists(destination_dir_path):
                os.makedirs(destination_dir_path)
            shutil.move(file_path, os.path.join(destination_dir_path, file_name))
        except:
            print(f'!!!!!======== Errors occurs while labeling case number {index+1} with case id {case_id} ========!!!!!')
            #{'error': {'message': 'The server had an error while processing your request. Sorry about that!', 'type': 'server_error', 'param': None, 'code': None}}
            break
        
    return output_dict

In [12]:
def inject_cases_info_to_dict(base_path, batch_name, input_dict):
    """
    Creates a DataFrame from the dictionary which is the output of openai api labeling
    
    """
    output_dict = {
        'unique_case_id':[],
        'official_case_id':[],
        'Type of Crime':[],
        'Location':[],
        'Year':[],
        'Date':[],  # Modified
        'Time':[],
        "Victim's Sex":[],
        "Offender's Sex":[],
        'Number of Victims':[],
        'Number of Offenders':[]
}   
    for index, (case_id, case_info) in enumerate(input_dict.items()):
        print(f'======== Extracting info from labeled case number {index+1}: {case_id} ========')
        try:
        #Put infos in a dictionary
            gpt_output_lst = case_info.strip().split('\n')
            gpt_output_lst = [ele.strip('\n').strip() for ele in gpt_output_lst]
        
            output_dict['official_case_id'].append(case_id.zfill(7))
            output_dict['unique_case_id'].append(str(index+1).zfill(6) + '_' + case_id)
            for element in gpt_output_lst:
                key, value = element.split(":")
                if key in output_dict.keys():
                    output_dict[key].append(value.strip())

        except:
            print(f'!!======== Errors occurs while procesing case id {case_id} ========!!')
            # Move file to folder
            file_name = f'pressemitteilung.{case_id}.json'
            current_file_path = os.path.join(base_path, batch_name, 'info-extracted-cases', file_name)
            destination_dir_path = os.path.join(base_path, batch_name, 'unsuccessful-cases')
            
            if not os.path.exists(destination_dir_path):
                os.makedirs(destination_dir_path)
            
#             shutil.move(current_file_path, os.path.join(destination_dir_path, file_name))
            break
    
    output_dict_renamed = {key.lower().replace(' ', '_').replace("'s", ""): value 
                               for key, value in output_dict.items()}
    
    return output_dict_renamed

In [13]:
def output_csv(batch_num, output_dir_path, df):
    if not os.path.exists(output_dir_path):
            os.makedirs(output_dir_path)
    
    cur_datetime = datetime.now().strftime("%Y-%m-%d_%H-%M")
    csv_file_name = f'batch{batch_num}_labeled_{df.shape[0]}_cases_promptv2_{cur_datetime}.csv'
    df.to_csv(os.path.join(output_dir_path, csv_file_name), sep=',', index=False)
    print(f'file save with name: {csv_file_name}')

In [14]:
def main_process(base_path, batch_name, cat_type,  num_case=5, export_csv=False):
    # Input files
    files_gen = sorted(
            glob.glob(
            os.path.join(base_path, batch_name, cat_type, 'pressemitteilung*')),
            reverse=True)
    
    
    # Main processing
    start_time = time.time()  # Measure run time of GPT labeling
    original_dict = {}
    original_dict_filled = label_cases_with_gpt(num_case, files_gen, original_dict)
    end_time = time.time() # Measure run time of GPT labeling
    runtime_minutes = (end_time - start_time) / 60
    print(f"======= GPT labeling runtime: {runtime_minutes:.2f} minutes======")
    info_dict = inject_cases_info_to_dict(original_dict_filled)
    
    df = pd.DataFrame.from_dict(info_dict)

    if export_csv:
        output_csv(path_to_json_files, df)
        
    return df

# Start looping

### Funciton seperated

In [45]:
# Get the current files in the specified folder
files_gen = sorted(
        glob.glob(
        os.path.join(base_path, batch_name, cat_type, 'pressemitteilung*')),
        reverse=True)

In [46]:
original_dict = {}
original_dict_filled = label_cases_with_gpt(200, base_path, batch_name, files_gen, original_dict)



In [42]:
info_dict = inject_cases_info_to_dict(base_path, batch_name, original_dict_filled)



In [43]:
df_final_sep = pd.DataFrame.from_dict(info_dict)
df_final_sep.tail()

Unnamed: 0,unique_case_id,official_case_id,type_of_crime,location,year,date,time,victim_sex,offender_sex,number_of_victims,number_of_offenders
195,000196_1125328,1125328,"Verbal Abuse, Hate Crime - Religious",Sprengelstraße,2021,12.09,13 Uhr,Male,Male,1,1
196,000197_1125327,1125327,"General Assault, Verbal Abuse, Hate Crime - Ge...",U-Bahnhof Eberswalder Straße,2021,12.09,Mitternacht,Male,Male,1,5-6
197,000198_1125326,1125326,"General Assault, Verbal Abuse, Hate Crime - Ge...","Waterloo-Ufer, Mehringplatz",2021,12.09,14.30 Uhr,Female,Male,1,1
198,000199_1125324,1125324,"Property Damage, Arson","Jägerstraße, Treptow-Köpenick",2021,12.09,3 Uhr,Unknown,Unknown,Unknown,Unknown
199,000200_1125323,1125323,"Property Damage, Arson","Meyerbeerstraße, Weißensee",2021,12.09,23.45 Uhr,Unknown,Unknown,8,Unknown


In [44]:
output_csv(2, output_dir_path, df_final_sep)

file save with name: batch2_labeled_200_cases_promptv2_2023-06-07_22-56.csv


 #### Diagnosis

In [None]:
original_dict_filled

In [None]:
info_dict

In [None]:
#TODO:
# One hot encdoing
#Timestamp

In [None]:
#Next Steps 
# Lena todo: make sure promt works in function for one case - (all cases)
# Hsin todo: loop through cases - output dataframe

### One function for all

In [None]:
df_final = main_process(base_path, batch_name, cat_type, 20)

In [None]:
df_final

In [None]:
# Save to CSV
output_csv(base_path, df_final)

## Backup