# Dataset Exploration for Real Estate Price Prediction


This notebook explores the USA Real Estate dataset, checks for inconsistencies, and outlines a plan to prepare it as natural language data for LLM fine-tuning.

1. Takes raw USA real estate data (CSV format) as input
2. Cleans and preprocesses the data (handling missing values, outliers, data types)
3. Converts tabular data into natural language text format suitable for LLM fine-tuning
4. Outputs structured JSON datasets (train/val splits) containing property descriptions and price information
5. save text data to upload it into Kaggle and Hugginge Face for loading data again via colab and Kaggle.

**Input raw data**
- [Real Estate Tabular Data](https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset)

**Output text data**
- [Kaggle Real Estate Text Data](https://www.kaggle.com/datasets/hebamo7amed/llm-real-estate-text-data/data)
- [Hugging Face Real Estate Text Data](https://huggingface.co/datasets/heba1998/real-estate-data-for-llm-fine-tuning)

---
## [0] Setup
---

#### Install packages

In [1]:
# !conda create -n llm_env python=3.12.9 -y
# !conda activate llm_env
# !pip install -r requirements.txt

In [None]:
import os
import json
import gc
import pandas as pd
import numpy as np
import kaggle 

from src.utils import check_missing, check_outliers_zscore, reduce_mem_usage, timeit

import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

seed = 2025

#### Download raw dataset from kaggle

In [9]:
if not os.path.exists('./data/usa-real-estate-dataset.zip'):
    print('File does not exist, downloading...')
    !kaggle datasets download -d ahmedshahriarsakib/usa-real-estate-dataset -p ./data
    !unzip -o ./data/usa-real-estate-dataset.zip -d ./data
else:
    print('File exists, skipping download...')

File does not exist, downloading...
Downloading usa-real-estate-dataset.zip to ./data




  0%|          | 0.00/38.2M [00:00<?, ?B/s]
  3%|▎         | 1.00M/38.2M [00:00<00:25, 1.50MB/s]
  5%|▌         | 2.00M/38.2M [00:00<00:16, 2.37MB/s]
  8%|▊         | 3.00M/38.2M [00:01<00:12, 2.89MB/s]
 10%|█         | 4.00M/38.2M [00:01<00:11, 3.24MB/s]
 13%|█▎        | 5.00M/38.2M [00:01<00:10, 3.46MB/s]
 16%|█▌        | 6.00M/38.2M [00:02<00:09, 3.62MB/s]
 18%|█▊        | 7.00M/38.2M [00:02<00:08, 3.70MB/s]
 21%|██        | 8.00M/38.2M [00:02<00:08, 3.71MB/s]
 24%|██▎       | 9.00M/38.2M [00:02<00:08, 3.73MB/s]
 26%|██▌       | 10.0M/38.2M [00:03<00:08, 3.69MB/s]
 29%|██▉       | 11.0M/38.2M [00:03<00:07, 3.73MB/s]
 31%|███▏      | 12.0M/38.2M [00:03<00:07, 3.72MB/s]
 34%|███▍      | 13.0M/38.2M [00:03<00:07, 3.70MB/s]
 37%|███▋      | 14.0M/38.2M [00:04<00:06, 3.69MB/s]
 39%|███▉      | 15.0M/38.2M [00:04<00:06, 3.69MB/s]
 42%|████▏     | 16.0M/38.2M [00:04<00:06, 3.69MB/s]
 44%|████▍     | 17.0M/38.2M [00:05<00:06, 3.68MB/s]
 47%|████▋     | 18.0M/38.2M [00:05<00:05, 3.65MB/s]
 

#### Data paths

In [12]:
import os
for dirname, listdir, filenames in os.walk('./data'):  
    if listdir: print(listdir)
    # for filename in filenames:  
    #    print(os.path.join(dirname, filename))

['tabular_data', 'text_data', 'usa-real-estate-dataset']


In [3]:
tabular_data_dir = 'data/tabular_data'
text_data_dir = 'data/text_data'
raw_data_path = 'data/usa-real-estate-dataset/realtor-data.zip.csv'

os.makedirs(tabular_data_dir, exist_ok=True)
os.makedirs(text_data_dir, exist_ok=True)

---
## [1] Load and Inspect the raw dataset
---

In [6]:
df = pd.read_csv(raw_data_path, low_memory=False)
print(f"Dataset shape: {df.shape}")

Dataset shape: (2226382, 12)


In [7]:
df.head()

Unnamed: 0,brokered_by,status,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,prev_sold_date
0,103378.0,for_sale,105000.0,3.0,2.0,0.12,1962661.0,Adjuntas,Puerto Rico,601.0,920.0,
1,52707.0,for_sale,80000.0,4.0,2.0,0.08,1902874.0,Adjuntas,Puerto Rico,601.0,1527.0,
2,103379.0,for_sale,67000.0,2.0,1.0,0.15,1404990.0,Juana Diaz,Puerto Rico,795.0,748.0,
3,31239.0,for_sale,145000.0,4.0,2.0,0.1,1947675.0,Ponce,Puerto Rico,731.0,1800.0,
4,34632.0,for_sale,65000.0,6.0,2.0,0.05,331151.0,Mayaguez,Puerto Rico,680.0,,


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2226382 entries, 0 to 2226381
Data columns (total 12 columns):
 #   Column          Dtype  
---  ------          -----  
 0   brokered_by     float64
 1   status          object 
 2   price           float64
 3   bed             float64
 4   bath            float64
 5   acre_lot        float64
 6   street          float64
 7   city            object 
 8   state           object 
 9   zip_code        float64
 10  house_size      float64
 11  prev_sold_date  object 
dtypes: float64(8), object(4)
memory usage: 203.8+ MB


In [9]:
df.describe(include='number')

Unnamed: 0,brokered_by,price,bed,bath,acre_lot,street,zip_code,house_size
count,2221849.0,2224841.0,1745065.0,1714611.0,1900793.0,2215516.0,2226083.0,1657898.0
mean,52939.89,524195.5,3.275841,2.49644,15.22303,1012325.0,52186.68,2714.471
std,30642.75,2138893.0,1.567274,1.652573,762.8238,583763.5,28954.08,808163.5
min,0.0,0.0,1.0,1.0,0.0,0.0,0.0,4.0
25%,23861.0,165000.0,3.0,2.0,0.15,506312.8,29617.0,1300.0
50%,52884.0,325000.0,3.0,2.0,0.26,1012766.0,48382.0,1760.0
75%,79183.0,550000.0,4.0,3.0,0.98,1521173.0,78070.0,2413.0
max,110142.0,2147484000.0,473.0,830.0,100000.0,2001357.0,99999.0,1040400000.0


In [10]:
df.describe(include='object')

Unnamed: 0,status,city,state,prev_sold_date
count,2226382,2224975,2226374,1492085
unique,3,20098,55,14954
top,for_sale,Houston,Florida,2022-03-31
freq,1389306,23862,249432,17171


---
## [2] Clean and Prepare the dataset
---



Check for any inconsistencies following these steps:

1. Convert columns to correct data types that reduce the data size
2. Check for duplicate records
3. Check for outliers using utility function (using Z-score)
4. Check negative values
5. Handle missing values
6. Use utility function to reduce the size.
7. Seperate the data with unknown label in a sheet for testing

### 1. Handle features types

Convert columns to correct data types that reduce the data size.
- Some columns are floats but they should be strings (e.g. `brokered_by`, `street`, `zip_code`)
- Some columns are strings but they should be datetime (e.g `prev_sold_date`)


In [None]:
df['brokered_by'] = df['brokered_by'].astype('str').str.replace('.0','').replace('nan', np.NAN)
df['zip_code'] = df['zip_code'].astype('str').str.replace('.0','').replace('nan', np.NAN)
df['street'] = df['street'].astype('str').str.replace('.0','').replace('nan', np.NAN)

df['prev_sold_date'] = pd.to_datetime(df['prev_sold_date'],format='mixed')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1492083 entries, 0 to 1492082
Data columns (total 12 columns):
 #   Column          Non-Null Count    Dtype         
---  ------          --------------    -----         
 0   brokered_by     1488471 non-null  object        
 1   status          1492083 non-null  object        
 2   price           1491583 non-null  float64       
 3   bed             1336443 non-null  float64       
 4   bath            1329987 non-null  float64       
 5   acre_lot        1297484 non-null  float64       
 6   street          1486855 non-null  object        
 7   city            1491745 non-null  object        
 8   state           1492083 non-null  object        
 9   zip_code        1492055 non-null  object        
 10  house_size      1269124 non-null  float64       
 11  prev_sold_date  1492083 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(5), object(6)
memory usage: 136.6+ MB


### 2. Check Duplicate Records

In [252]:
df.duplicated().sum()

11811

### 3. Outlier Detection

In [253]:
def check_outliers_zscore(df, threshold=3):

    numeric_cols = df.select_dtypes(include=np.number).columns
    outliers = {}
    
    for col in numeric_cols:
        z_scores = np.abs((df[col] - df[col].mean()) / df[col].std())
        outliers[col] = len(z_scores[z_scores > threshold])
    
    return pd.DataFrame({
        'feature': outliers.keys(),
        'num_outliers': outliers.values(),
        'percent_outliers (%)': (np.array(list(outliers.values()))) / len(df) * 100
    })

In [254]:
check_outliers_zscore(df)

Unnamed: 0,feature,num_outliers,percent_outliers (%)
0,price,10381,0.763114
1,bed,9783,0.719155
2,bath,13997,1.028929
3,acre_lot,629,0.046238
4,house_size,1136,0.083508


> **Observations**

- The feature  `price` has *8601* outliers, which is 0.58% of the total data.
- The feature  `acre_lot` has 551 outliers, which is 0.04% of the total data.
- The feature  `house_size` has 107 outliers, which is 0.01% of the total data.
- The feature  `bed` and`bath` have no outliers.


### 4. Check for Negative Values

Some features, such as `price`, `bed`, `bath`, `acre_lot`, and `house_size`, are not supposed to have negative values. I will check for any negative values in these columns and handle them appropriately.

In [255]:
features = df.select_dtypes(include='number').columns
features

Index(['price', 'bed', 'bath', 'acre_lot', 'house_size'], dtype='object')

In [256]:
for col in features:
    if (df[col]<0).any():
        print(f"Negative values found in column {col}")
        print(df.loc[df[col]<0, col].value_counts(normalize=True).head())


Negative values found in column bed
bed
-68     0.4
-66     0.2
-44     0.2
-108    0.2
Name: proportion, dtype: float64
Negative values found in column bath
bath
-34    0.333333
-58    0.166667
-93    0.166667
-44    0.166667
-81    0.166667
Name: proportion, dtype: float64


> NO negative values

### 5. Handle missing values

In [11]:
check_missing(df)

Unnamed: 0,feature,num_missing,percent_missing,num_unique,most_common
0,prev_sold_date,734297,32.981627,14954,2022-03-31
1,house_size,568484,25.533983,12061,1200.0
2,bath,511771,22.986666,86,2.0
3,bed,481317,21.618797,99,3.0
4,acre_lot,325589,14.62413,16057,0.17
5,street,10866,0.488056,2001358,1916862.0
6,brokered_by,4533,0.203604,110143,22611.0
7,price,1541,0.069215,102137,350000.0
8,city,1407,0.063197,20098,Houston
9,zip_code,299,0.01343,30334,33993.0


**Drop Some Columns**

- `prev_sold_date`: Irrelavant column with 32% missing values
- `street`: Encoded Categorical Column without benefits.
- `brokered_by`: Encoded Categorical Column without benefits.

In [12]:
df.drop(['prev_sold_date', 'street', 'brokered_by'], axis=1, inplace=True)

**Drop rows with `state`, `city`, or `street` missing**

Tiny amount of missing values


In [13]:
null_condition = df[['state', 'city', 'price', 'zip_code']].isna().any(axis=1)

missing_percentage = len(df[null_condition]) * 100 / len(df)
print(f"Percentage of missing values: {missing_percentage:.2f}%")

Percentage of missing values: 0.14%


> Drop 0.14% of the dataset, it is not a big deal

In [14]:
df = df[~null_condition]
df.reset_index(drop=True, inplace=True)
print(f"DataFrame shape after removing nulls: {df.shape}")

DataFrame shape after removing nulls: (2223239, 9)


**Fill numerical features with `-1`**

I will handle this in the process of converting to natural language

In [15]:
null_condition = df[['bath', 'bed', 'acre_lot', 'house_size']].isna().any(axis=1)

missing_percentage = len(df[null_condition]) * 100 / len(df)
print(f"Percentage of missing values: {missing_percentage:.2f}%")

Percentage of missing values: 38.81%


> Cannot remove 38% of the dataset

In [16]:
df['bath'].fillna(-1, inplace=True)
df['bed'].fillna(-1, inplace=True)
df['acre_lot'].fillna(-1, inplace=True)
df['house_size'].fillna(-1, inplace=True)

> For Future Work, we can use iterative imputer

In [17]:
check_missing(df)

Dataset has no missing values


0

**Reduce data size of `bath` and `bed`**

In [18]:
# Found negative vlues in bath and acre_lot columns
df['bath'] = df['bath'].astype('int8')
df['bed'] = df['bed'].astype('int8')

### 6. Memory size reduction

In [19]:
reduce_mem_usage(df)

Memory usage reduced to 89.05 MB (27.6% reduction)


Unnamed: 0,status,price,bed,bath,acre_lot,city,state,zip_code,house_size
0,for_sale,105000.0,3,2,0.12,Adjuntas,Puerto Rico,601.0,920.0
1,for_sale,80000.0,4,2,0.08,Adjuntas,Puerto Rico,601.0,1527.0
2,for_sale,67000.0,2,1,0.15,Juana Diaz,Puerto Rico,795.0,748.0
3,for_sale,145000.0,4,2,0.10,Ponce,Puerto Rico,731.0,1800.0
4,for_sale,65000.0,6,2,0.05,Mayaguez,Puerto Rico,680.0,-1.0
...,...,...,...,...,...,...,...,...,...
2223234,sold,359900.0,4,2,0.33,Richland,Washington,99354.0,3600.0
2223235,sold,350000.0,3,2,0.10,Richland,Washington,99354.0,1616.0
2223236,sold,440000.0,6,3,0.50,Richland,Washington,99354.0,3200.0
2223237,sold,179900.0,2,1,0.09,Richland,Washington,99354.0,933.0


> Memory usage reduced from 203.8 MB to 113.95 MB (44% reduction)

### Save sheets for train/val sets

In [20]:
# shuffle data
df_shuffled = df.sample(frac=1, random_state=seed).reset_index(drop=True)
split_index = int(0.99 * len(df_shuffled)) # Define split size

# Split train data to train and validation with 99/1 ratio
train_data = df_shuffled[:split_index]
val_data = df_shuffled[split_index:]

In [21]:
train_data.reset_index(drop=True, inplace=True)
val_data.reset_index(drop=True, inplace=True)

print(f"Train data shape: {train_data.shape}, Val data shape: {val_data.shape}")

Train data shape: (2201006, 9), Val data shape: (22233, 9)


In [22]:
# release memory  
del df_shuffled, df
gc.collect()

0

In [23]:
# Save train, validation, and test data
train_data.to_csv(f'{tabular_data_dir}/train_data.csv', index=False)
val_data.to_csv(f'{tabular_data_dir}/val_data.csv', index=False)

---
## [3] Prepare Data for LLM Fine-Tuning
---
1. Convert the tabular data to natural language datasets in josnl (json list) format.
2. Save the text data for fine-tuning.

### Format data for LLM fine-tuning

In [24]:
from pydantic import BaseModel, Field
from IPython.display import JSON, Markdown

class ResponseSchema(BaseModel):
    estimated_house_price: float = Field(...,
                                description="Numerical value that expresses the estimated house price",
                                example=85000.0)

JSON(ResponseSchema.model_json_schema())

<IPython.core.display.JSON object>

In [25]:
ResponseSchema(estimated_house_price=2510010.0).model_dump_json()

'{"estimated_house_price":2510010.0}'

In [26]:
def return_val(raw, col):
    val = raw[col]
    if val  == -1:
        return 'missing info'
    else:
        return val

In [28]:
def translate_data(row, idx):

    description = "\n".join([   
        "A house listing in the USA with the following details:\n" ,
        f"- Status: {row['status']}\n",
        f"- Number of bedrooms: {row['bed']}\n",
        f"- Number of bathrooms: {row['bath']}\n",
        f"- Land size: {row['acre_lot']} acres\n",
        f"- Address (city, state, zip): {row['city']}, {row['state']}, {row['zip_code']}\n"
        f"- House size: {row['house_size']} sqft\n",
        "Your task is to predict the final sale price in $?",
        "### Output schema:",
        f"{ResponseSchema.model_json_schema()}",
        "### Response: \n ```json"
    ])
        
    return {
        "id": idx,
        "query": description,
        "response": ResponseSchema(estimated_house_price=row['price']).model_dump_json()
    }


In [30]:
from tqdm import tqdm
from src.utils import timeit

@timeit
def translate_all_rows(df):
    text_data = []
    bar_format = '{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]'
    for idx, row in tqdm(df.iterrows(),
                         total=len(df), unit="sample",
                         ncols=100, colour='green',
                         desc="Translating tabular data to text",
                         bar_format=bar_format):
        
        translated_text = translate_data(row, idx)
        text_data.append(translated_text)
    return text_data

### Translate train dataset to natural language

In [31]:
text_train_data = translate_all_rows(train_data)

Translating tabular data to text:   0%|[32m                                        [0m| 0/2201006 [00:00<?][0m

Translating tabular data to text: 100%|[32m██████████████████████████████[0m| 2201006/2201006 [23:11<00:00][0m



 Data completed in 23.21 minutes.


In [None]:
print("Tabular Data Length", train_data.shape)
print("Text Data Length: ", len(text_train_data))
print("Sample:  \n", text_train_data[0])

Text Data Length:  2201006
Sample:  
 {'id': 0, 'query': "A house listing in the USA with the following details:\n\n- Status: for_sale\n\n- Number of bedrooms: -1\n\n- Number of bathrooms: -1\n\n- Land size: 0.7300000190734863 acres\n\n- Address (city, state, zip): Port Aransas, Texas, 78373.0\n- House size: -1.0 sqft\n\nYour task is to predict the final sale price in $?\n### Output schema:\n{'properties': {'estimated_house_price': {'description': 'Numerical value that expresses the estimated house price', 'example': 85000.0, 'title': 'Estimated House Price', 'type': 'number'}}, 'required': ['estimated_house_price'], 'title': 'ResponseSchema', 'type': 'object'}\n### Response: \n ```json", 'response': '{"estimated_house_price":295000.0}'}


### Translate Validation dataset to natural language

In [32]:
text_val_data = translate_all_rows(val_data)

Translating tabular data to text: 100%|[32m██████████████████████████████████[0m| 22233/22233 [00:11<00:00][0m


 Data completed in 0.18 minutes.





In [33]:
print("Tabular Data Length", val_data.shape)
print("Text Data Length: ", len(text_val_data))
print("Sample:  \n", text_val_data[0])

Tabular Data Length (22233, 9)
Text Data Length:  22233
Sample:  
 {'id': 0, 'query': "A house listing in the USA with the following details:\n\n- Status: for_sale\n\n- Number of bedrooms: -1\n\n- Number of bathrooms: -1\n\n- Land size: 0.07000000029802322 acres\n\n- Address (city, state, zip): Washington, District of Columbia, 20002.0\n- House size: -1.0 sqft\n\nYour task is to predict the final sale price in $?\n### Output schema:\n{'properties': {'estimated_house_price': {'description': 'Numerical value that expresses the estimated house price', 'example': 85000.0, 'title': 'Estimated House Price', 'type': 'number'}}, 'required': ['estimated_house_price'], 'title': 'ResponseSchema', 'type': 'object'}\n### Response: \n ```json", 'response': '{"estimated_house_price":2500000.0}'}


In [34]:
# Release un-needed variables from memory
del train_data, val_data 
gc.collect()

4419

---
## [4] Upload data to Huggingface Hub
---

In [3]:
from datetime import datetime

username = "heba1998"
data_title = "Real Estate Data For LLM Fine-Tuning"
repo_name = data_title.replace(" ", "-").lower()
date = datetime.now().strftime("%Y-%m-%d")

metadata = {
    "title": data_title,
    "id": f"{username}/{repo_name}",
    "licenses": [{"name": "CC0-1.0"}],
    "description": "Translated Text data generated from tabular US real estate data for LLM fine-tuning",
    "version": "1.0",
    "created_at": date,
    "tags": [
        "LLM",
        "Text Data",
        "Real Estate"
    ],
}

In [36]:
import json

with open(f'{text_data_dir}/text_train_data.jsonl', 'w') as json_file:
    json.dump(text_train_data, json_file, indent=4)

with open(f'{text_data_dir}/text_val_data.jsonl', 'w') as json_file:
    json.dump(text_val_data, json_file, indent=4)
    
with open(f'{text_data_dir}/dataset-metadata.json', 'w') as json_file:
    json.dump(metadata, json_file, indent=4)

> Data will upload to Huggingface Hub in order to be used with the tiny LLM model via colab

**Log in to Hugging Face**

In [None]:
import huggingface_hub
from huggingface_hub import HfApi

huggingface_hub.login(os.getenv("HF_TOKEN"))
api = HfApi()

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


**Define repository details**

In [2]:
import json
text_data_dir = "data/text_data"
with open(f'{text_data_dir}/text_val_data.jsonl', 'r') as json_file:
    text_val_data = json.load(json_file)

In [None]:
sample_50 = text_val_data[:50]
with open(f'{text_data_dir}/sample_50.jsonl', 'w') as json_file:
    json.dump(sample_50, json_file, indent=4)

In [15]:
username = "heba1998"
data_title = "Real Estate Data For LLM Fine-Tuning"
repo_name = data_title.replace(" ", "-").lower()

api.upload_file(
    path_or_fileobj=f"{text_data_dir}/sample_50.jsonl",
    repo_id=f"{username}/{repo_name}",
    repo_type="dataset",
    create_pr=True,
    path_in_repo="sample_50.jsonl",
    commit_message="Add sample_50.jsonl file",
    revision="main",
    )

CommitInfo(commit_url='https://huggingface.co/datasets/heba1998/real-estate-data-for-llm-fine-tuning/commit/f5b76d66c88709faf45be0e43ba27808a0f8c03a', commit_message='Add sample_50.jsonl file', commit_description='', oid='f5b76d66c88709faf45be0e43ba27808a0f8c03a', pr_url='https://huggingface.co/datasets/heba1998/real-estate-data-for-llm-fine-tuning/discussions/2', repo_url=RepoUrl('https://huggingface.co/datasets/heba1998/real-estate-data-for-llm-fine-tuning', endpoint='https://huggingface.co', repo_type='dataset', repo_id='heba1998/real-estate-data-for-llm-fine-tuning'), pr_revision='refs/pr/2', pr_num=2)

In [None]:
# Upload dir
username = "heba1998"
api.upload_folder(
    folder_path=text_data_dir,
    repo_id=f"{username}/{repo_name}",
    repo_type="dataset",
    create_pr=True,
)


text_train_data.jsonl:   0%|          | 0.00/1.66G [00:00<?, ?B/s]'(MaxRetryError("HTTPSConnectionPool(host='hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com', port=443): Max retries exceeded with url: /repos/bf/77/bf775231d95a7a83e0137e6940f880f969999cde9bab280e791fc89b1370e3b9/e748311cae4b68cf9b8bd9813f2e2e1fb97d15ad89d31d273fa17e56e884d72b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20250507%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250507T040539Z&X-Amz-Expires=86400&X-Amz-Signature=98d2829d4c24c6255e44adc0e4cdebe7ac182d3455e47544253b00a358f655e0&X-Amz-SignedHeaders=host&partNumber=1&uploadId=KjrS5ZZXOoDP7C24n9KG_EAbneCKmhxD9kHgrvav3RE35EAAofe6QUTkiEP_QwOXu6GZSCVU2V99RSuyI9b6rJagdTeTTw172mO09MLsmYEs16ooLU2gQOo7kvCsQtPW&x-id=UploadPart (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1010)')))"), '(Requ