# Regression

## Classic regression solution using Scikit-Learn

Let's use the California housing dataset. You can get it from Scikit-Learn library. Description of the dataset is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing)

In [1]:
# Dirty patch to fix SSL error
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)): 
    ssl._create_default_https_context = ssl._create_unverified_context

In [4]:
# Import the California housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

print(housing.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [6]:
import pandas as pd

housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [9]:
housing_df['target'] = housing.target
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


What if Ridge wasn't working?

We could always try another model, 
    
How about to try ensemble model (an ensemble model is a model made up of many models)

In [3]:
# Let's try the Random Forest Regressor
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

# split into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate Random Forest Regressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

# Check the score of the Random Forest Regressor model on test data
rf.score(X_test, y_test)

NameError: name 'housing_df' is not defined

In [12]:
# let's predict the first 10 rows of the test data
rf.predict(X_test[:10])

array([0.49058  , 0.75989  , 4.9350165, 2.55864  , 2.33461  , 1.6580801,
       2.34237  , 1.66708  , 2.5609601, 4.8519781])

In [13]:
# Compare the predictions to the actual values
np.array(y_test[:10])

array([0.477  , 0.458  , 5.00001, 2.186  , 2.78   , 1.587  , 1.982  ,
       1.575  , 3.4    , 4.466  ])

# Let's try to do the sa,e with OpenAI

let's prepare a dataset for OpenAI model to train

We will use the same dataset as above, but we will split it into training and test datasets

In [10]:
# get first 1000 rows for training
housing_df_train = housing_df[:1000].copy()

# get the 20 of the rest for testing
housing_df_test = housing_df[1000:1020].copy()

In [11]:
# save the training dataset to csv file
housing_df_train.to_csv('data/regression/housing_train.csv', index=False)

# remove the target column from the test dataset
targets = housing_df_test['target']

# fill the target column with zeros
housing_df_test["target"] = 0

# save the test dataset to csv file
housing_df_test.to_csv('data/regression/housing_test.csv', index=False)

In [31]:
# check the data
# housing_df_train.head()
# housing_df_test
len(housing_df_test)

20

In [13]:
# Load OpenAI
import dotenv
from openai import OpenAI

dotenv.load_dotenv()

client = OpenAI()

In [14]:
# upload the file to OpenAI
file = client.files.create(file=open("data/regression/housing_train.csv", "rb"), purpose="assistants")

In [15]:
# create an assistant
assistant = client.beta.assistants.create(
  instructions="You are a linear regression tool. "
               "You have a training dataset file with the California housing dataset. It contains 1000 rows and 9 columns. "
               "The columns are: MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, target. "
                "The target column is the target variable. "
                "Your task is to analyze the data and provide the best possible linear regression model to predict the target variable for the file, provided by user in the next message. "
               "Tou have to return the json object with the same structure as the input file, but with the target column filled with the predictions instead of zeros"
               "Please respond only in json_object format. Do not add any additional information above or below.",
  model="gpt-4o-mini",
  tools=[{"type": "code_interpreter"}],
  tool_resources={
    "code_interpreter": {
      "file_ids": [file.id]
    }
  }
)

In [16]:
# upload the test file to OpenAI
file_test = client.files.create(file=open("data/regression/housing_test.csv", "rb"), purpose="assistants")

In [17]:
thread = client.beta.threads.create(
  messages=[
    {
      "role": "user",
      "content": "Read the provided CSV file with California housing data without a terget column. Make a target column prediction based on the training dataset",
      "attachments": [
        {
          "file_id": file_test.id,
          "tools": [{"type": "code_interpreter"}]
        }
      ],
    }
  ]
)

In [24]:
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, 
    assistant_id=assistant.id,
)

messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))

In [19]:
messages[0].content[0].text.value

'```json\n[{"MedInc":8.3252,"HouseAge":41.0,"AveRooms":6.984127,"AveBedrms":1.02381,"Population":322.0,"AveOccup":2.555556,"Latitude":37.88,"Longitude":-122.23,"target":4.526},"{"MedInc":8.3014,"HouseAge":21.0,"AveRooms":6.238137,"AveBedrms":0.97188,"Population":2401.0,"AveOccup":2.109842,"Latitude":37.86,"Longitude":-122.22,"target":3.585},"{"MedInc":7.2574,"HouseAge":52.0,"AveRooms":8.288136,"AveBedrms":1.073446,"Population":496.0,"AveOccup":2.80226,"Latitude":37.85,"Longitude":-122.24,"target":3.521},"{"MedInc":5.6431,"HouseAge":52.0,"AveRooms":5.817352,"AveBedrms":1.073059,"Population":558.0,"AveOccup":2.547945,"Latitude":37.85,"Longitude":-122.25,"target":3.413},"{"MedInc":3.8462,"HouseAge":52.0,"AveRooms":6.281853,"AveBedrms":1.081081,"Population":565.0,"AveOccup":2.181467,"Latitude":37.85,"Longitude":-122.25,"target":3.422},{"MedInc":4.8437,"HouseAge":29.0,"AveRooms":5.674468,"AveBedrms":0.802631,"Population":1479.0,"AveOccup":2.950926,"Latitude":37.56,"Longitude":-121.93,"targe

In [26]:
import json

response = messages[0].content[0].text.value
response = response.replace("```json", "").replace("```", "").replace("\n", "")

predictions = json.loads(response)

predictions

[{'MedInc': 8.3252,
  'HouseAge': 41.0,
  'AveRooms': 6.984127,
  'AveBedrms': 1.02381,
  'Population': 322.0,
  'AveOccup': 2.555556,
  'Latitude': 37.88,
  'Longitude': -122.23,
  'target': 4.526},
 {'MedInc': 8.3014,
  'HouseAge': 21.0,
  'AveRooms': 6.238137,
  'AveBedrms': 0.97188,
  'Population': 2401.0,
  'AveOccup': 2.109842,
  'Latitude': 37.86,
  'Longitude': -122.22,
  'target': 3.585},
 {'MedInc': 7.2574,
  'HouseAge': 52.0,
  'AveRooms': 8.288136,
  'AveBedrms': 1.073446,
  'Population': 496.0,
  'AveOccup': 2.80226,
  'Latitude': 37.85,
  'Longitude': -122.24,
  'target': 3.521},
 {'MedInc': 5.6431,
  'HouseAge': 52.0,
  'AveRooms': 5.817352,
  'AveBedrms': 1.073059,
  'Population': 558.0,
  'AveOccup': 2.547945,
  'Latitude': 37.85,
  'Longitude': -122.25,
  'target': 3.413},
 {'MedInc': 3.8462,
  'HouseAge': 52.0,
  'AveRooms': 6.281853,
  'AveBedrms': 1.081081,
  'Population': 565.0,
  'AveOccup': 2.181467,
  'Latitude': 37.85,
  'Longitude': -122.25,
  'target': 3.422

In [30]:
# output the predictions target column
predicted_targets = [row['target'] for row in predictions]
len(predicted_targets)

38

In [28]:
targets

1000    1.844
1001    1.584
1002    1.746
1003    1.684
1004    1.884
1005    2.567
1006    1.838
1007    1.834
1008    1.775
1009    1.869
1010    2.389
1011    1.725
1012    1.792
1013    2.044
1014    2.094
1015    2.642
1016    3.500
1017    1.998
1018    2.110
1019    2.011
Name: target, dtype: float64

In [33]:
pred_results = pd.DataFrame({'actual': targets, 'predicted': predicted_targets})
pred_results.head(20)

ValueError: array length 38 does not match index length 20

## Summary

This is not a perfect solution to use OpenAI for regression tasks