Dataset: StackSample - 10% of Stack Overflow QA 

In [7]:
import openai
import os
import pandas as pd

In [3]:
# get string of the key
openai.api_key = os.getenv("OPENAI_API_KEY")

In [8]:
qa_df = pd.read_csv("python_qa.csv")
qa_df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,ParentId,Answer
0,11060,912.0,2008-08-14T13:59:21Z,,18,How should I unit test a code-generator?,This is a difficult and open-ended question I ...,11060,I started writing up a summary of my experienc...
1,17250,394.0,2008-08-20T00:16:40Z,,24,Create an encrypted ZIP file in Python,I'm creating an ZIP file with ZipFile in Pytho...,17250,I created a simple library to create a passwor...
2,31340,242853.0,2008-08-27T23:44:47Z,,71,"How do threads work in Python, and what are co...",I've been trying to wrap my head around how th...,31340,"Yes, because of the Global Interpreter Lock (G..."
3,34020,3561.0,2008-08-29T05:43:16Z,,17,Are Python threads buggy?,A reliable coder friend told me that Python's ...,34020,Python threads are good for concurrent I/O pro...
4,34570,577.0,2008-08-29T16:10:41Z,2011-11-08T16:11:43Z,13,What is the best quick-read Python book out th...,I am taking a class that requires Python. We w...,34570,"I loved Dive Into Python, especially if you're..."


In [9]:
questions, answers = qa_df['Body'], qa_df['Answer']

In [10]:
questions

0       This is a difficult and open-ended question I ...
1       I'm creating an ZIP file with ZipFile in Pytho...
2       I've been trying to wrap my head around how th...
3       A reliable coder friend told me that Python's ...
4       I am taking a class that requires Python. We w...
                              ...                        
4424    I am trying to determine what percentage of th...
4425    How can we make a class represent itself as a ...
4426    I thought I could make my python (2.7.10) code...
4427    Say, I have given a DataFrame with most of the...
4428    Let's say I have the following code:\n\na = [1...
Name: Body, Length: 4429, dtype: object

In [11]:
answers

0       I started writing up a summary of my experienc...
1       I created a simple library to create a passwor...
2       Yes, because of the Global Interpreter Lock (G...
3       Python threads are good for concurrent I/O pro...
4       I loved Dive Into Python, especially if you're...
                              ...                        
4424    setup\ncreate 2 time series\n\nfrom StringIO i...
4425    TLDR: It's impossible to make custom classes r...
4426    You are not indexing. You are yielding a list;...
4427    You can create a look up data frame from the d...
4428    Use itertools.product within a list comprehens...
Name: Answer, Length: 4429, dtype: object

In [15]:
# convert to JSON
# qa_openai_format = [{"prompt":"body", "completion":"answer"}]
qa_openai_format = [{"prompt":q, "completion":ans} for q,ans in zip(questions,answers)]
qa_openai_format[4]


{'prompt': 'I am taking a class that requires Python. We will review the language in class next week, and I am a quick study on new languages, but I was wondering if there are any really great Python books I can grab while I am struggling through the basics of setting up my IDE, server environment and all those other "gotchas" that come with a new programming language. Suggestions?\n',
 'completion': "I loved Dive Into Python, especially if you're a quick study.  The beginning basics are all covered (and may move slowly for you), but the latter few chapters are great learning tools.\n\nPlus, Pilgrim is a pretty good writer.\n"}

In [16]:
qa_openai_format[4]['prompt']

'I am taking a class that requires Python. We will review the language in class next week, and I am a quick study on new languages, but I was wondering if there are any really great Python books I can grab while I am struggling through the basics of setting up my IDE, server environment and all those other "gotchas" that come with a new programming language. Suggestions?\n'

In [23]:
response = openai.Completion.create(
    model = "text-babbage-001",
    prompt = qa_openai_format[4]['prompt'],
    max_tokens = 250,
    temperature = 0
)

In [24]:
print(response['choices'][0]['text'])


There are a few great Python books that you could consider while you are learning Python. One book that is particularly helpful is "Python for Data Science" by Geoffrey Hinton. This book is packed with information on data science and Python, and it is a great resource for anyone who wants to learn Python for data science purposes. Another great book to consider is "Python for Data Science Mastery" by Michael Nielsen. This book is designed to help you learn more about data science and Python, and it is a great resource for anyone who wants to learn more about Python for data science purposes.


WARNING: Model Halucinating... Wrong response as the above stated books by the author does not exist...

In [19]:
response = openai.Completion.create(
    model = "text-davinci-003",
    prompt = qa_openai_format[4]['prompt'],
    max_tokens = 250,
    temperature = 0
)

In [21]:
print(response['choices'][0]['text'])


Some great Python books to consider include:

1. Automate the Boring Stuff with Python by Al Sweigart
2. Python Crash Course by Eric Matthes
3. Python for Data Analysis by Wes McKinney
4. Python Cookbook by David Beazley and Brian K. Jones
5. Learning Python by Mark Lutz
6. Fluent Python by Luciano Ramalho
7. Python in a Nutshell by Alex Martelli
8. Python Pocket Reference by Mark Lutz
9. Python for Kids by Jason R. Briggs
10. Python Essential Reference by David Beazley


much better response by the 'davinci-003' model...

Let's go ahead and see what happens if we're able to train a weaker model.
So we're going to use the Babbage model as our baseline of what we're trying to improve upon.

Fine tuning and Price Estimation

We can use the **tiktoken** library to estimate costs, by counting tokens the same way OpenAI does. 

tiktoken support 3 different encodings for OpenAI models:   
- *gpt2* for most gpt-3 models
- *p50k_base* for code methods, and Davinci models, like **text-davinci-003**
- *cl100k_base* for **text-embedding-ada-002**

In [25]:
import tiktoken

In [37]:
def num_tokens_from_string(string, encoding_name):
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [38]:
len(qa_df)

4429

In [39]:
# using only first 500 rows for saving costs purposes
dataset_size = 500

In [40]:
import json

with open("example_trainin_data.json","w") as f:
    for entry in qa_openai_format[:dataset_size]:
        f.write(json.dumps(entry))
        f.write("\n")

In [41]:
print(qa_openai_format[:1])

[{'prompt': "This is a difficult and open-ended question I know, but I thought I'd throw it to the floor and see if anyone had any interesting suggestions.\n\nI have developed a code-generator that takes our python interface to our C++ code (generated via SWIG) and generates code needed to expose this as WebServices.  When I developed this code I did it using TDD, but I've found my tests to be brittle as hell.  Because each test essentially wanted to verify that for a given bit of input code (which happens to be a C++ header) I'd get a given bit of outputted code I wrote a small engine that reads test definitions from XML input files and generates test cases from these expectations.\n\nThe problem is I dread going in to modify the code at all.  That and the fact that the unit tests themselves are a: complex, and b: brittle.\n\nSo I'm trying to think of alternative approaches to this problem, and it strikes me I'm perhaps tackling it the wrong way.  Maybe I need to focus more on the out

In [42]:
token_counter = 0
for prompt_completion in qa_openai_format:
    for prompt, completion in prompt_completion.items():
        token_counter += num_tokens_from_string(prompt,'gpt2')
        token_counter += num_tokens_from_string(completion,'gpt2')

In [43]:
token_counter

2719388

In [44]:
# for babbage --> $0.0006 per 1000 tokens (training) * 4 epochs 
print(f"There are {token_counter} tokens")
print(f"Fine tuning using babbage costs $0.0006 per 1000 tokens")
print(f"Estimated price: ${(4*token_counter / 1000) * 0.0006}")

There are 2719388 tokens
Fine tuning using babbage costs $0.0006 per 1000 tokens
Estimated price: $6.526531199999999


In [45]:
token_counter = 0
for prompt_completion in qa_openai_format[:500]:
    for prompt, completion in prompt_completion.items():
        token_counter += num_tokens_from_string(prompt,'gpt2')
        token_counter += num_tokens_from_string(completion,'gpt2')
token_counter

197362

In [46]:
# for babbage --> $0.0006 per 1000 tokens (training) * 4 epochs 
print(f"There are {token_counter} tokens")
print(f"Fine tuning using babbage costs $0.0006 per 1000 tokens")
print(f"Estimated price: ${(4*token_counter / 1000) * 0.0006}")

There are 197362 tokens
Fine tuning using babbage costs $0.0006 per 1000 tokens
Estimated price: $0.47366879999999995


In order to conduct fine tuning.

Openai documentation actually recommends using the command line tool.

## Command Line for Fine-Tuning

Note, you can find the full official guide here:

https://platform.openai.com/docs/guides/fine-tuning

OpenAI recommends using the terminal/command line via their OpenAI tool, which you have by simply running:

    pip install --upgrade openai
    


    openai api fine_tunes.create -t training_data.json -m babbage

You can use:

*openai api fine_tunes.list* to get a list of your fine tuning jobs, 

*openai api fine_tunes.get -i <YOUR_FINE_TUNE_JOB_ID>* to get the debug log of your fine tuning process


To cancel the fine-tuning job, run:

    openai api fine_tunes.cancel -i (your_job_id)

In [47]:
# get fine tuned model from terminal
fine_tuned_model = "babbage:ft-sff-2023-03-23-19-48-16"

In [48]:
# Original model
response = openai.Completion.create(
    model = 'text-babbage-001',
    prompt = "What are good Python books?",
    max_tokens = 128
)

In [49]:
print(response['choices'][0]['text'])



Some good Python books that may be of interest include "Python for Data Science & analytics" or "Python for Scientific Computing".


In [50]:
# Fine tuned model
response = openai.Completion.create(
    model = fine_tuned_model,
    prompt = "What are good Python books?",
    max_tokens = 128
)

In [51]:
print(response['choices'][0]['text'])

 The answer is: it depends. For one, you should consider which language you are learning. Thus, the experiences of those who have learnt one language might be worth considering (hint hint...some of the excellent books listed here were written by an English dude). For the other hand, you might want to take a look at the Python general reading list.
To begin with, I suggest Gitplug for a general overview of what's going on with the Python standard library. It's also a good idea to look at other books you might be planning on learning about python.  It's impossible to withall any learning curve with python


In [52]:
# Another fine tuned model
fine_tuned_model2 = "babbage:ft-sff-2023-03-23-19-21-31"

response = openai.Completion.create(
    model = fine_tuned_model2,
    prompt = "What are good Python books?",
    max_tokens = 128
)
print(response['choices'][0]['text'])



In my opinion the top three (in rough order of best to worst) are:




What is Python?

	 by Henrique Giegwn

	 	 	 The book is short so you can read it in one sitting, and it's easy to follow.

	 	 	 It's well written, but what I like best is the speaker restriction. If you are good at explaining things to other people, you'll know this book will be great for that.


Real World Python

	 	 	 Probably the best guide to use as a reference, this is brief enough to
