# Data Mining using ChatGPT

as for now, we have gathered the required info we need to start our analysis, text messages and text in photos are extracted and joined together within a single string. <br>

now we have a full_text column that contains both details, but how can we extract info from text?.

By looking at the data, we couldn't find any pattern, or structure to extract info, I tried using regular expression to extract job title for example, but the results wasn't accurate, and that is mainly because of the data nature. <br>
Next, I tried to use SpaCy to extract named entities and other info, but as well as the previous method, results was messy and not accurate. <br>
A more advanced method that came to my mind was to use a pretrained deep learning model, such as BERT, GPT-2, etc for extraction of entities and other information from job post.

So, I decided to give ChatGPT a shot, I logged into the ChatGPT playgroud and started asking it to extract the needed info from job post, after multiple attempts to find the right question that would generate the suitable response, I managed to find it, in a way that will generate a response that is easy to extract info from.

After using ChatGPT, I have noticed the following:
- ChatGPT always generate an answer, even when the input data is not sufficient. 
- it can change the response order, even when I specified the order of needed answer.

In [1]:
import openai
import os
import pandas as pd
import numpy as np
from dotenv import load_dotenv

In [2]:
# loading API key to use ChatGPT in code
load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

In [3]:
data = pd.read_csv("../data/data_v2.csv")
data.head()


Unnamed: 0,id,date,photo,width,height,text,text_entities,raw_text,type,ocr_res,full_text
0,1567,2022-01-02T13:46:28,photos/photo_1205@02-01-2022_13-46-28.jpg,800.0,419.0,"['Job Title:', {'type': 'hashtag', 'text': '#s...","[{'type': 'plain', 'text': 'Job Title:'}, {'ty...",Job Title:#senior and a junior #developer\n \n...,job_post,We are on the hunt for a JUNIOR/SENIOR Develop...,Job Title:#senior and a junior #developer\n \n...
1,1568,2022-01-03T11:09:36,photos/photo_1206@03-01-2022_11-09-36.jpg,1110.0,1124.0,"['Job Title:', {'type': 'hashtag', 'text': '#c...","[{'type': 'plain', 'text': 'Job Title:'}, {'ty...",Job Title:#cashier\nJob Type: #full_time\n \nش...,job_post,Jolgi Job Type:Full Time Send your CV to m.jaw...,Job Title:#cashier\nJob Type: #full_time\n \nش...
2,1569,2022-01-03T14:28:11,photos/photo_1207@03-01-2022_14-28-11.jpg,1280.0,1267.0,"['Company: ', {'type': 'hashtag', 'text': '#Na...","[{'type': 'plain', 'text': 'Company: '}, {'typ...",Company: #National_Technology_Group #NTG)\nJob...,job_post,JOIN OURTEAM We're Hiring. NET DEVELOPER +lyea...,Company: #National_Technology_Group #NTG)\nJob...
3,1570,2022-01-03T17:12:13,photos/photo_1208@03-01-2022_17-12-13.jpg,1014.0,1124.0,"['Job title: ', {'type': 'hashtag', 'text': '#...","[{'type': 'plain', 'text': 'Job title: '}, {'t...",Job title: #Employees for Operations Departmen...,job_post,Job Vacancy Employees for Operations Departmen...,Job title: #Employees for Operations Departmen...
4,1571,2022-01-03T19:16:11,photos/photo_1209@03-01-2022_19-16-11.jpg,1062.0,1125.0,"[{'type': 'link', 'text': 'https://www.faceboo...","[{'type': 'link', 'text': 'https://www.faceboo...",https://www.facebook.com/384708578676644/posts...,link,Jmloi AWASOL Developer Remotely Send your CV t...,https://www.facebook.com/384708578676644/posts...


In [4]:
openai.api_key = OPENAI_API_KEY

# Set up the model and prompt
model_engine = "text-davinci-003"
question = """
extract the following fields from the text: job title, company name, salary, location, job type, years of experience, and required skills as keywords, if no info were found set field to none:
"""
job_post = ''
prompt = ""

In [5]:
# calling ChatGPT API on each job post's full text
responses = np.empty(data.full_text.shape , dtype='O')

for i in range(0,data.shape[0]):
    job_post = data.full_text.values[i]
    if job_post is np.nan:
      responses[i] = ''
      continue
    if job_post == '':
      responses[i] = ''
      continue
    try:
      completion = openai.Completion.create(
      engine=model_engine,
      prompt=question + '\n' + job_post + '\n',
      max_tokens=500,
      n=1,
      stop=None,
      temperature=0)
      responses[i] = completion.choices[0].text
    except:
      responses[i] = 'Exception'
    
    latest_idx = i # to not strat from scratch if something went wrong
    if i%100==0:
        print(i)

0
100
200
300
400
500
600
700
800


In [5]:
data['description'] = responses

In [6]:
data.head()

Unnamed: 0,id,date,photo,width,height,text,text_entities,raw_text,type,ocr_res,full_text,description
0,1567,2022-01-02T13:46:28,photos/photo_1205@02-01-2022_13-46-28.jpg,800.0,419.0,"['Job Title:', {'type': 'hashtag', 'text': '#s...","[{'type': 'plain', 'text': 'Job Title:'}, {'ty...",Job Title:#senior and a junior #developer\n \n...,job_post,We are on the hunt for a JUNIOR/SENIOR Develop...,Job Title:#senior and a junior #developer\n \n...,\nJob Title: Junior/Senior Developer\nCompany ...
1,1568,2022-01-03T11:09:36,photos/photo_1206@03-01-2022_11-09-36.jpg,1110.0,1124.0,"['Job Title:', {'type': 'hashtag', 'text': '#c...","[{'type': 'plain', 'text': 'Job Title:'}, {'ty...",Job Title:#cashier\nJob Type: #full_time\n \nش...,job_post,Jolgi Job Type:Full Time Send your CV to m.jaw...,Job Title:#cashier\nJob Type: #full_time\n \nش...,\nJob Title: Cashier\nCompany Name: Euroline-R...
2,1569,2022-01-03T14:28:11,photos/photo_1207@03-01-2022_14-28-11.jpg,1280.0,1267.0,"['Company: ', {'type': 'hashtag', 'text': '#Na...","[{'type': 'plain', 'text': 'Company: '}, {'typ...",Company: #National_Technology_Group #NTG)\nJob...,job_post,JOIN OURTEAM We're Hiring. NET DEVELOPER +lyea...,Company: #National_Technology_Group #NTG)\nJob...,\nJob title: .NET Developer\nCompany Name: Nat...
3,1570,2022-01-03T17:12:13,photos/photo_1208@03-01-2022_17-12-13.jpg,1014.0,1124.0,"['Job title: ', {'type': 'hashtag', 'text': '#...","[{'type': 'plain', 'text': 'Job title: '}, {'t...",Job title: #Employees for Operations Departmen...,job_post,Job Vacancy Employees for Operations Departmen...,Job title: #Employees for Operations Departmen...,\nCompany Name: None\nSalary: None\nYears of E...
4,1571,2022-01-03T19:16:11,photos/photo_1209@03-01-2022_19-16-11.jpg,1062.0,1125.0,"[{'type': 'link', 'text': 'https://www.faceboo...","[{'type': 'link', 'text': 'https://www.faceboo...",https://www.facebook.com/384708578676644/posts...,link,Jmloi AWASOL Developer Remotely Send your CV t...,https://www.facebook.com/384708578676644/posts...,\nJob Title: AWASOL Developer \nCompany Name: ...


In [62]:
data.to_csv("../data/data_v3.csv",index=False)