<h2>Using openai for code assistance</h2>
<li>openai has a handy api</li>
<li>with appropriate prompts, openai (chatgpt) can help in data analytics</li>

In [1]:
%pip install openai
%pip install catboost


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


<h3>API Key</h3>
<li>Get an API key from https://platform.openai.com/account/api-keys</li>
<li>(You can log in with a google account)</li>

In [2]:
with open("../credentials/openai") as f:
    OPEN_API_KEY = f.read().strip()

In [3]:
import openai
openai.api_key = OPEN_API_KEY


<h2>Conversations</h2>
<li>chatGPT uses a "message response" format to handle queries</li>
<li>Messages contain a role designation and a prompt</li>
<li>Possible roles are system, assistant, and user</li>
<li>Currently, roles don't do a whole lot but openai plans to expand their use down the road</li>


<h2>Example</h2>
<li>Ask chatGPT a question about GDP </li>

In [4]:
MODEL = "gpt-4"

In [5]:
MODEL = "gpt-4"
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system","content":"You are a helful assistant"},
        {"role": "user", "content": "What was the US GDP in 2020"},
    ]
)

<h4>The response</h4>
<li>chatGPT returns a response object</li>
<li>Extract the text of the response</li>
<li>And note the role</li>
<li>The response is a JSON object</li>

In [6]:
response

<OpenAIObject chat.completion id=chatcmpl-7xgfe2GSNG8WIVqF7oAszS4tJhg0Y at 0x7fbd696fa4f0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The US GDP in 2020 was approximately $21.43 trillion. Please note that this figure may vary slightly depending on the source of the data, and that the COVID-19 pandemic had a significant impact on global economies in this year.",
        "role": "assistant"
      }
    }
  ],
  "created": 1694459734,
  "id": "chatcmpl-7xgfe2GSNG8WIVqF7oAszS4tJhg0Y",
  "model": "gpt-4-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 49,
    "prompt_tokens": 26,
    "total_tokens": 75
  }
}

In [7]:
response_content = response.choices[0].message.content
print(response_content)

The US GDP in 2020 was approximately $21.43 trillion. Please note that this figure may vary slightly depending on the source of the data, and that the COVID-19 pandemic had a significant impact on global economies in this year.


<h3>Ask a followup question</h3>
<li>You need to include the response in the new question</li>


In [None]:
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system","content":"You are a helful assistant"},
        {"role": "user", "content": "What was the US GDP in 2020"},
        {"role": "assistant","content": response_content},
        {"role": "user","content": "By what percent did it grow over 2019?"}
    ]
)

In [None]:
response_content_2 = response.choices[0].message.content
print(response_content_2)

<h2>coding help</h2>
<li>GPT models can provide coding help</li>


<h4>Documentation questions</h4>

In [8]:
prompt = """
I want to read a pdf file into a python string. How can I do that?
"""

response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system","content":"You are a helpful assistant"},
        {"role":"user","content": prompt}
    ])

In [9]:
print(response.choices[0].message.content)

To read a PDF file into a Python string, you can use a library called PyPDF2. Here is a small example:

You would first need to install the library using pip:

```bash
pip install PyPDF2
```

And here is the Python code to read the PDF:

```python
import PyPDF2

def read_pdf(file):
    pdf_file_obj = open(file, 'rb')
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)
    page_obj = pdf_reader.getPage(0)
    text = page_obj.extractText()
    pdf_file_obj.close()
    return text

# Use the function
print(read_pdf('your_file.pdf'))
```

This snippet defines a function that opens a PDF file, reads the text from the first page (you can loop over all pages if you like), and then closes the file.

The above function reads only the text content from the PDF files and discards everything else like images, tables, etc. If you want to read a PDF file as is including images and tables you might want to convert it into an image using libraries like pdf2image and then use OCR(optical character reco

<h3>Get some  code</h3>
<li>Make the prompt as detailed as you like</li>

In [10]:
prompt = """
Answer the question using python code

Context:
I have data in a pandas dataframe. One column in the dataframe is 
named income and contains annual family income which ranges from $30,000 to $720,000.
A second column is named household_size and contains the number of members
in the family (ranging from 0 to 6)

Q: Using python's sklearn package, use feature engineering on the two columns 
to come up with a new income feature

A:
"""
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system","content":"You are an expert python coder"},
        {"role": "user", "content": prompt}
    ])


In [11]:
print(response.choices[0].message.content)

There are different feature engineering methods to derive a new feature. The most basic form could be normalizing the income column or creating an income per member feature. 

Here's an example how you can do it using sklearn's MinMaxScaler for normalization, and then by a simple division to achieve income per member:

```python
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np

# let's assume this is your dataframe
df = pd.DataFrame({
    'income': np.random.randint(30000, 720000, 100),
    'household_size': np.random.randint(1, 7, 100)
})

# use MinMaxScaler to normalize your income data
scaler = MinMaxScaler()
df['income_scaled'] = scaler.fit_transform(df[['income']])

# create a new feature: income per member
df['income_per_member'] = df['income'] / df['household_size']
```

Remember that, in a real case scenario, you will have to verify the best feature that helps to improve your machine learning model. These new features could have limited impa

<h2>Generate sample data</h2>

In [12]:
prompt = """Write code to generate a pandas dataframe with two columns 
named income and household_size. Values in the income column should
be drawn from a normal distribution with mean 300,000 and standard deviation
150000. Values in the household_size column should be drawn from
a uniform distribution with values ranging from 1 to 6
"""
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system","content":"You are an expert python coder"},
        {"role": "user", "content": prompt}
    ]
)

In [13]:
print(response.choices[0].message.content)

You will first need to import the necessary libraries: pandas and numpy.

Here's a simple Python code snippet that would generate the desired DataFrame:

```python
import pandas as pd
import numpy as np

# set a seed for reproducibility
np.random.seed(0)

# number of samples
num_samples = 1000

# generate the data
income = np.random.normal(loc=300000, scale=150000, size=num_samples)
household_size = np.random.randint(low=1, high=7, size=num_samples)

# create DataFrame
df = pd.DataFrame({
    'income': income,
    'household_size': household_size
})

print(df.head())
```

In this code, `num_samples` signifies the number of samples/data points you want to generate. I've used 1000 as an example, but you can adjust this to your needs. The `head()` method is used to print just the first 5 entries, for brevity.


In [14]:
import pandas as pd
import numpy as np

# set a seed for reproducibility
np.random.seed(0)

# number of samples
num_samples = 1000

# generate the data
income = np.random.normal(loc=300000, scale=150000, size=num_samples)
household_size = np.random.randint(low=1, high=7, size=num_samples)

# create DataFrame
df = pd.DataFrame({
    'income': income,
    'household_size': household_size
})

print(df.head())

          income  household_size
0  564607.851895               1
1  360023.581255               4
2  446810.697616               4
3  636133.979880               2
4  580133.698522               6


<h4>Web scraping</h4>

In [19]:
prompt = """
url: http://www.xe.com/currencytables/from="USD"

Q: Using the URL above, write python code using the requests library and
the BeautifulSoup library that returns a list containing all the cross
currency rates for USD
"""

response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system","content":"You are an expert python coder"},
        {"role": "user", "content":  prompt},
    ],
    temperature=0,
)

In [20]:
print(response.choices[0].message.content)

Sure, here is a simple Python script that uses the `requests` and `BeautifulSoup` libraries to scrape the cross currency rates for USD from the provided URL.

```python
import requests
from bs4 import BeautifulSoup

def get_currency_rates(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    table = soup.find('table')
    rows = table.find_all('tr')[1:]  # Skip the header row

    rates = []
    for row in rows:
        cols = row.find_all('td')
        currency_code = cols[0].text.strip()
        rate = cols[2].text.strip()
        rates.append((currency_code, rate))

    return rates

url = 'http://www.xe.com/currencytables/?from=USD'
rates = get_currency_rates(url)
for rate in rates:
    print(rate)
```

This script first sends a GET request to the provided URL and parses the response text with BeautifulSoup. It then finds the table in the parsed HTML, skips the header row, and iterates over the remaining rows. For each row, it finds all 

In [18]:
import requests
from bs4 import BeautifulSoup

def get_usd_rates(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    table = soup.find('table')
    rows = table.find_all('tr')

    rates = []

    for row in rows[1:]:  # Skip the header row
        cols = row.find_all('td')
        currency_code = cols[0].text.strip()
        rate = cols[1].text.strip()
        rates.append((currency_code, rate))

    return rates

url = 'http://www.xe.com/currencytables/'
usd_rates = get_usd_rates(url)

for rate in usd_rates:
    print(rate)

('EUR / USD', '1.07470')
('GBP / EUR', '1.16407')
('USD / JPY', '146.579')
('GBP / USD', '1.25102')
('USD / CHF', '0.891175')
('USD / CAD', '1.35831')
('EUR / JPY', '157.528')
('AUD / USD', '0.642884')


In [22]:
import requests
from bs4 import BeautifulSoup

def get_currency_rates(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    table = soup.find('table')
    rows = table.find_all('tr')[1:]  # Skip the header row

    rates = []
    for row in rows:
        cols = row.find_all('td')
        currency_code = cols[0].text.strip()
        rate = cols[1].text.strip()
        rates.append((currency_code, rate))

    return rates

url = 'http://www.xe.com/currencytables/?from=USD'
rates = get_currency_rates(url)
for rate in rates:
    print(rate)

('US Dollar', '1')
('Euro', '0.9347301400875501')
('British Pound', '0.8021114256909363')
('Indian Rupee', '83.01214277585939')
('Australian Dollar', '1.566983400711194')
('Canadian Dollar', '1.3642644518256')
('Singapore Dollar', '1.3663133437280797')
('Swiss Franc', '0.8928708107095304')
('Malaysian Ringgit', '4.675453018525618')
('Japanese Yen', '147.86031790484608')
('Chinese Yuan Renminbi', '7.319677031559003')
('New Zealand Dollar', '1.6997649486387005')
('Thai Baht', '35.644267439787896')
('Hungarian Forint', '359.79744500461084')
('Emirati Dirham', '3.6725')
('Hong Kong Dollar', '7.839832973188957')
('Mexican Peso', '17.585928354221426')
('South African Rand', '19.118838601672614')
('Philippine Peso', '56.66376235562046')
('Swedish Krona', '11.125699326967517')
('Indonesian Rupiah', '15325.890963831705')
('Brazilian Real', '4.989044309045255')
('Saudi Arabian Riyal', '3.75')
('Turkish Lira', '26.827180109961326')
('Kenyan Shilling', '146.28204273239004')
('South Korean Won', '1

In [26]:
import requests
from bs4 import BeautifulSoup

def get_currency_rates(currency_symbol):
    url = 'http://www.xe.com/currencytables/?from='+currency_symbol
    print(url)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    table = soup.find('table')
    rows = table.find_all('tr')[1:]  # Skip the header row

    rates = []
    for row in rows:
        cols = row.find_all('td')
        currency_code = cols[0].text.strip()
        rate = cols[1].text.strip()
        rates.append((currency_code, rate))

    return rates

rates = get_currency_rates("GBP")
for rate in rates:
    print(rate)

http://www.xe.com/currencytables/?from=GBP
('US Dollar', '1.2467095817000775')
('Euro', '1.1653370219510044')
('British Pound', '1')
('Indian Rupee', '103.49203379611878')
('Australian Dollar', '1.9535732200316176')
('Canadian Dollar', '1.7008415640637793')
('Singapore Dollar', '1.7033959372304683')
('Swiss Franc', '1.1131505949318878')
('Malaysian Ringgit', '5.8289320769844375')
('Japanese Yen', '184.33887508519115')
('Chinese Yuan Renminbi', '9.12551149019459')
('New Zealand Dollar', '2.119113248105808')
('Thai Baht', '44.43804974986366')
('Hungarian Forint', '448.562922158455')
('Emirati Dirham', '4.578540938793535')
('Hong Kong Dollar', '9.77399488660288')
('Mexican Peso', '21.924545382298927')
('South African Rand', '23.83563927568256')
('Philippine Peso', '70.64325546392818')
('Swedish Krona', '13.870515954044507')
('Indonesian Rupiah', '19106.935112699623')
('Brazilian Real', '6.219889343612962')
('Saudi Arabian Riyal', '4.675160931375291')
('Turkish Lira', '33.44570249308252')
