# Seminar 3: Data Formats and APIs

March 25, 2025
- Vilém Krejcar

______________________________

## What are we going to learn

- What are the most popular data formats and how to work with them
- HTTP requests
- How to scrape websites 

## 1. Data formats

### Data Serialization and Formats
Serialization is the process of converting data structures or objects into a format that can be easily stored and shared. Common data formats in Python include:
- **JSON**: Lightweight and commonly used for APIs.
- **CSV**: Ideal for tabular data, especially in spreadsheets.
- **Excel**: Classic data storage everyone is aware of.

#### JSON

In [2]:
import json
import pandas as pd
import os
    
data = {
    "name": "Alice",
    "age": 30,
    "is_member": True,
    "hobbies": ["reading", "biking", "coding"]
}

# Define the directory and file path
directory = 'Data'
file_path = os.path.join(directory, 'json_sample.json')


# Create the directory if it doesn't exist
os.makedirs(directory, exist_ok=True)

# Serialize to JSON -> file
with open(file_path, 'w') as f:
    json.dump(data, f)



In [3]:
# Deserialize JSON from file -> Python
with open(file_path, 'r') as f:
    data_loaded = json.load(f)

print("Loaded JSON data:", data_loaded)

Loaded JSON data: {'name': 'Alice', 'age': 30, 'is_member': True, 'hobbies': ['reading', 'biking', 'coding']}


In [5]:
pd.DataFrame(data_loaded)

Unnamed: 0,name,age,is_member,hobbies
0,Alice,30,True,reading
1,Alice,30,True,biking
2,Alice,30,True,coding


In [12]:
pd.DataFrame(data_loaded)

Unnamed: 0,name,age,is_member,hobbies
0,Alice,30,True,reading
1,Alice,30,True,biking
2,Alice,30,True,coding


#### CSV

In [6]:
import csv

data = [
    ["Name", "Age", "Occupation"],
    ["Alice", 30, "Engineer"],
    ["Bob", 25, "Designer"],
    ["Charlie", 35, "Teacher"]
]

# Again define the directory and create new file path
directory = 'Data'
file_path = os.path.join(directory, 'csv_sample.csv')

# Check if the directory exists
os.makedirs(directory, exist_ok=True)

# Write to CSV
with open(file_path, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(data)

In [9]:
# Read CSV
csv_data = []
with open(file_path, 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        #print(row)
        csv_data.append(row)
csv_data

[['Name', 'Age', 'Occupation'],
 ['Alice', '30', 'Engineer'],
 ['Bob', '25', 'Designer'],
 ['Charlie', '35', 'Teacher']]

In [10]:
pd.DataFrame(csv_data)

Unnamed: 0,0,1,2
0,Name,Age,Occupation
1,Alice,30,Engineer
2,Bob,25,Designer
3,Charlie,35,Teacher


#### Excel

In [None]:
data = {
    "name": "Thomas",
    "age": 30,
    "is_member": True,
    "hobbies": "reading, biking, coding"
}

# Create a DataFrame from the dictionary.
# Wrapping data in a list makes each key a column with one row.
df = pd.DataFrame([data])

# Define the directory and file path for the Excel file.
directory = 'Data'
file_path = os.path.join(directory, 'excel_sample.xlsx')

# Create the directory if it doesn't exist.
os.makedirs(directory, exist_ok=True)

df


Unnamed: 0,name,age,is_member,hobbies
0,Thomas,30,True,"reading, biking, coding"


In [12]:
# Write the DataFrame to an Excel file.
df.to_excel(file_path, index=False)

In [14]:
file_path

'Data/excel_sample.xlsx'

In [13]:
# Read the Excel file back into a DataFrame.
df_loaded = pd.read_excel(file_path)
df_loaded


Unnamed: 0,name,age,is_member,hobbies
0,Thomas,30,True,"reading, biking, coding"


## 2. Requests and web scraping

In [15]:
import requests # for making HTTP requests
import pandas as pd 
import time
import re # Regex = Regular Expressions

Time package small things:

In [16]:
%%time
print("Hello, World!")

Hello, World!
CPU times: user 46 μs, sys: 4 μs, total: 50 μs
Wall time: 53.9 μs


In [17]:
time.sleep(1)

In [18]:
t0 = time.time()
time.sleep(1)
t1 = time.time()
print("Time elapsed: ", t1-t0, " seconds")

Time elapsed:  1.0054819583892822  seconds


In [19]:
%%time 
time.sleep(2)

CPU times: user 1.07 ms, sys: 2.73 ms, total: 3.8 ms
Wall time: 2 s


In [20]:
import random

In [21]:
%%time
r_time = random.uniform(0.5, 1.2)
print("Sleeping for ", r_time, " seconds")
time.sleep(r_time)

Sleeping for  0.8794302997972834  seconds
CPU times: user 1.82 ms, sys: 3.33 ms, total: 5.16 ms
Wall time: 884 ms


#### Task 1: Requesting API

Let us work with data of sreality.cz which we can access via their api. An intuition is that the api is limited for a number of requests (but not verified).

##### 1a. Create a function requesting data from sreality

```python
base_url = 'https://www.sreality.cz/api/cs/v2/estates?category_main_cb=1&category_type_cb=1&locality_region_id=10&per_page60&page={}'.format(i)

r = requests.get(base_url)
d = r.json()
```

0) function should parametrize: 
    * `category_main_cb` - `{'flat':1, 'house':2, 'land':3 }`
    * `category_type_cb` - `{'sell':1,'rent':2}`
    * `locality_region_id` - use 10 as default value
    * `page` parameter
1) use string inputs for `category_main_cb` and `category_type_cb`
2) include `try/except` clause to handle errors
3) function should return JSON data in python types
4) do not forget to sleep each request at least 0.5s

In [26]:
import requests
base_url = 'https://www.sreality.cz/api/cs/v2/estates?category_main_cb=1&category_type_cb=1&locality_region_id=10&per_page60&page=0'
r = requests.get(base_url)
r.json()

{'meta_description': '4033 realit v nabídce prodej bytů Praha. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.',
 'result_size': 4033,
 '_embedded': {'estates': [{'labelsReleased': [['in_construction', 'garage'],
     ['train']],
    'has_panorama': 0,
    'labels': ['Ve výstavbě', 'Garáž', 'Vlak 1 min. pěšky'],
    'is_auction': False,
    'labelsAll': [['in_construction',
      'personal',
      'terrace',
      'cellar',
      'elevator',
      'parking_lots',
      'garage'],
     ['vet',
      'small_shop',
      'tavern',
      'candy_shop',
      'playground',
      'theater',
      'movies',
      'atm',
      'drugstore',
      'medic',
      'restaurant',
      'kindergarten',
      'bus_public_transport',
      'train',
      'sports',
      'post_office',
      'school',
      'tram',
      'metro',
      'shop']],
    'seo': {'category_main_cb': 1,
     'category_sub_cb': 8,
     'category_type_cb': 1,
     'local

In [29]:
number = 10
text_to_complete = 'Today, I had {} apples.'.format(number)
text_to_complete2 = f'Today, I had {number} apples.'
print(text_to_complete2)
text_to_complete


Today, I had 10 apples.


'Today, I had 10 apples.'

In [30]:
def request_sreality(page, category_main_str, category_type_str, locality_region_id=10):
    """
    Request data from sreality.cz API
    :param page: page number
    :param category_main_str: category of the property
    :param category_type_str: type of the offer
    :param locality_region_id: region id
    :return json: json response
    """
    category_mains = {'flat':1, 'house':2, 'land':3 }
    category_types = {'sell':1, 'rent':2}
    template_url = 'https://www.sreality.cz/api/cs/v2/estates?category_main_cb={category_main}&category_type_cb={category_type}&locality_region_id={locality_region_id}&per_page60&page={page}'
    request_url = template_url.format(
        category_main=category_mains[category_main_str],
        category_type=category_types[category_type_str],
        locality_region_id=locality_region_id,
        page=page
    )
    # sleep for random time
    r_sleep = random.uniform(0.5, 1.2)
    time.sleep(r_sleep)
    # make request with try-except block
    try:
        r = requests.get(request_url)
    except requests.exceptions.RequestException as e:
        print(e)
        return None
    
    return r.json() # return json response

In [34]:
category_mains = {'flat':1, 'house':2, 'land':3 }

category_mains['flat']

1

In [35]:
d = request_sreality(0, 'flat', 'sell', 10)
d

{'meta_description': '4031 realit v nabídce prodej bytů Praha. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.',
 'result_size': 4031,
 '_embedded': {'estates': [{'labelsReleased': [['in_construction'], ['train']],
    'has_panorama': 0,
    'labels': ['Ve výstavbě', 'Vlak 1 min. pěšky'],
    'is_auction': False,
    'labelsAll': [['in_construction',
      'personal',
      'terrace',
      'cellar',
      'elevator',
      'parking_lots',
      'garage'],
     ['vet',
      'small_shop',
      'tavern',
      'candy_shop',
      'playground',
      'theater',
      'movies',
      'atm',
      'drugstore',
      'medic',
      'restaurant',
      'kindergarten',
      'bus_public_transport',
      'train',
      'sports',
      'post_office',
      'school',
      'tram',
      'metro',
      'shop']],
    'seo': {'category_main_cb': 1,
     'category_sub_cb': 8,
     'category_type_cb': 1,
     'locality': 'praha-uhrineves-'

Inspect the element `d`:

In [38]:
d['_embedded'].keys()

dict_keys(['estates', 'is_saved', 'not_precise_location_count'])

In [29]:
d['meta_description']

'4034 realit v nabídce prodej bytů Praha. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.'

In [30]:
d['result_size']

4034

In [41]:
d['_embedded']['estates'][0]

{'labelsReleased': [['in_construction'], ['train']],
 'has_panorama': 0,
 'labels': ['Ve výstavbě', 'Vlak 1 min. pěšky'],
 'is_auction': False,
 'labelsAll': [['in_construction',
   'personal',
   'terrace',
   'cellar',
   'elevator',
   'parking_lots',
   'garage'],
  ['vet',
   'small_shop',
   'tavern',
   'candy_shop',
   'playground',
   'theater',
   'movies',
   'atm',
   'drugstore',
   'medic',
   'restaurant',
   'kindergarten',
   'bus_public_transport',
   'train',
   'sports',
   'post_office',
   'school',
   'tram',
   'metro',
   'shop']],
 'seo': {'category_main_cb': 1,
  'category_sub_cb': 8,
  'category_type_cb': 1,
  'locality': 'praha-uhrineves-'},
 'exclusively_at_rk': 1,
 'category': 1,
 'has_floor_plan': 1,
 '_embedded': {'favourite': {'is_favourite': False,
   '_links': {'self': {'profile': '/favourite/doc',
     'href': '/cs/v2/favourite/181654092',
     'title': 'Oblibene inzeraty'}}},
  'note': {'note': '',
   '_links': {'self': {'profile': '/note/doc',
   

In [32]:
d['_embedded']['is_saved']

{'email_notification': False,
 'notification_advert_count': 0,
 'stack_id': 0,
 'push_notification': False,
 '_links': {'self': {'href': '/cs/v2/stacks'}},
 'removed': False,
 'saved': False}

In [33]:
d['_embedded']['estates'][0] # first estate - unlisted

{'labelsReleased': [['in_construction', 'terrace', 'cellar'], ['train']],
 'has_panorama': 0,
 'labels': ['Ve výstavbě', 'Terasa', 'Sklep', 'Vlak 1 min. pěšky'],
 'is_auction': False,
 'labelsAll': [['in_construction',
   'personal',
   'terrace',
   'cellar',
   'elevator',
   'parking_lots',
   'garage'],
  ['vet',
   'small_shop',
   'tavern',
   'candy_shop',
   'playground',
   'theater',
   'movies',
   'atm',
   'drugstore',
   'medic',
   'restaurant',
   'kindergarten',
   'bus_public_transport',
   'train',
   'sports',
   'post_office',
   'school',
   'tram',
   'metro',
   'shop']],
 'seo': {'category_main_cb': 1,
  'category_sub_cb': 8,
  'category_type_cb': 1,
  'locality': 'praha-uhrineves-'},
 'exclusively_at_rk': 1,
 'category': 1,
 'has_floor_plan': 1,
 '_embedded': {'favourite': {'is_favourite': False,
   '_links': {'self': {'profile': '/favourite/doc',
     'href': '/cs/v2/favourite/181654092',
     'title': 'Oblibene inzeraty'}}},
  'note': {'note': '',
   '_links

### 1b. Create a function converting sreality json data into pandas dataframe

In [42]:
len(d['_embedded']['estates']) # number of estates

21

In [15]:
d['_embedded']['estates'][4]

{'labelsReleased': [[], []],
 'has_panorama': 0,
 'labels': [],
 'is_auction': False,
 'labelsAll': [['personal', 'brick'],
  ['tavern',
   'theater',
   'small_shop',
   'candy_shop',
   'vet',
   'movies',
   'playground',
   'sightseeing',
   'drugstore',
   'bus_public_transport',
   'school',
   'tram',
   'shop',
   'atm',
   'restaurant',
   'kindergarten',
   'medic',
   'post_office',
   'metro',
   'sports',
   'train']],
 'seo': {'category_main_cb': 1,
  'category_sub_cb': 3,
  'category_type_cb': 1,
  'locality': 'praha-zizkov-sladkovskeho-namesti'},
 'exclusively_at_rk': 1,
 'category': 1,
 'has_floor_plan': 1,
 '_embedded': {'favourite': {'is_favourite': False,
   '_links': {'self': {'profile': '/favourite/doc',
     'href': '/cs/v2/favourite/144118348',
     'title': 'Oblibene inzeraty'}}},
  'note': {'note': '',
   '_links': {'self': {'profile': '/note/doc',
     'href': '/cs/v2/note/144118348',
     'title': 'Poznamky inzeratu'}},
   'has_note': False},
  'company': {'

In [43]:
def convert_sreality_data_to_df(sreality_data):
    return pd.DataFrame(sreality_data['_embedded']['estates'])

raw = convert_sreality_data_to_df(d)

In [44]:
raw.head()

Unnamed: 0,labelsReleased,has_panorama,labels,is_auction,labelsAll,seo,exclusively_at_rk,category,has_floor_plan,_embedded,...,hash_id,attractive_offer,price,price_czk,_links,rus,name,region_tip,gps,has_matterport_url
0,"[[in_construction], [train]]",0,"[Ve výstavbě, Vlak 1 min. pěšky]",False,"[[in_construction, personal, terrace, cellar, ...","{'category_main_cb': 1, 'category_sub_cb': 8, ...",1,1,1,"{'favourite': {'is_favourite': False, '_links'...",...,181654092,0,16298000,"{'value_raw': 16298000, 'unit': '', 'name': 'C...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 4+kk 112 m²,3191081,"{'lat': 50.021845912870795, 'lon': 14.60306408...",False
1,"[[after_reconstruction, panel, partly_furnishe...",0,"[Po rekonstrukci, Panelová, Částečně vybavený,...",False,"[[personal, after_reconstruction, panel, cella...","{'category_main_cb': 1, 'category_sub_cb': 3, ...",0,1,0,"{'favourite': {'is_favourite': False, '_links'...",...,3571479116,0,5770000,"{'value_raw': 5770000, 'unit': '', 'name': 'Ce...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 1+1 35 m²,0,"{'lat': 50.01731091287079, 'lon': 14.527898087...",False
2,"[[], [post_office, drugstore]]",0,"[Pošta 472 metrů pěšky, Lékárna 409 metrů pěšky]",False,"[[personal, brick, cellar, elevator], [sports,...","{'category_main_cb': 1, 'category_sub_cb': 4, ...",0,1,0,"{'favourite': {'is_favourite': False, '_links'...",...,892981836,0,9787000,"{'value_raw': 9787000, 'unit': '', 'name': 'Ce...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 2+kk 76 m²,0,"{'lat': 50.07601891287079, 'lon': 14.481808087...",False
3,"[[after_reconstruction, furnished], [metro, dr...",0,"[Po rekonstrukci, Vybavený, Metro 4 min. pěšky...",False,"[[personal, after_reconstruction, brick, cella...","{'category_main_cb': 1, 'category_sub_cb': 8, ...",1,1,0,"{'favourite': {'is_favourite': False, '_links'...",...,2011771468,0,0,"{'value_raw': 0, 'unit': '', 'name': 'Celková ...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 4+kk 165 m²,0,"{'lat': 50.07317691287079, 'lon': 14.444686087...",False
4,"[[], []]",0,[],False,"[[new_building, personal, terrace, cellar, ele...","{'category_main_cb': 1, 'category_sub_cb': 8, ...",1,1,0,"{'favourite': {'is_favourite': False, '_links'...",...,3553169996,0,25212000,"{'value_raw': 25212000, 'unit': '', 'name': 'C...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 4+kk 134 m²,0,"{'lat': 50.022070912870795, 'lon': 14.51959808...",False


### 1c. link function `1b` into function `1a`

In [80]:
df = convert_sreality_data_to_df(request_sreality(0, 'flat', 'sell', 10))
df.head()

Unnamed: 0,labelsReleased,has_panorama,labels,is_auction,labelsAll,seo,exclusively_at_rk,category,has_floor_plan,_embedded,...,hash_id,attractive_offer,price,price_czk,_links,rus,name,region_tip,gps,has_matterport_url
0,"[[in_construction, garage], [train]]",0,"[Ve výstavbě, Garáž, Vlak 1 min. pěšky]",False,"[[in_construction, personal, terrace, cellar, ...","{'category_main_cb': 1, 'category_sub_cb': 8, ...",1,1,1,"{'favourite': {'is_favourite': False, '_links'...",...,181654092,0,14518000,"{'value_raw': 14518000, 'unit': '', 'name': 'C...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 4+kk 100 m²,3191081,"{'lat': 50.02321621360754, 'lon': 14.601693786...",False
1,"[[partly_furnished], [post_office]]",0,"[Částečně vybavený, Pošta 6 min. pěšky]",False,"[[personal, balcony, brick, cellar, elevator, ...","{'category_main_cb': 1, 'category_sub_cb': 6, ...",0,1,1,"{'favourite': {'is_favourite': False, '_links'...",...,131142988,0,31293000,"{'value_raw': 31293000, 'unit': '', 'name': 'C...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 3+kk 160 m² (Podkrovní),0,"{'lat': 50.07438321360754, 'lon': 14.417012786...",False
2,"[[partly_furnished], []]",0,[Částečně vybavený],False,"[[personal, brick, partly_furnished], [tavern,...","{'category_main_cb': 1, 'category_sub_cb': 6, ...",1,1,0,"{'favourite': {'is_favourite': False, '_links'...",...,288179532,0,10057000,"{'value_raw': 10057000, 'unit': '', 'name': 'C...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 3+kk 96 m² (Podkrovní),0,"{'lat': 50.05343621360754, 'lon': 14.399236786...",False
3,"[[panel, furnished], [medic, drugstore]]",0,"[Panelová, Vybavený, Lékař 7 min. pěšky, Lékár...",False,"[[personal, panel, cellar, elevator, furnished...","{'category_main_cb': 1, 'category_sub_cb': 4, ...",1,1,0,"{'favourite': {'is_favourite': False, '_links'...",...,3775754828,0,6935000,"{'value_raw': 6935000, 'unit': '', 'name': 'Ce...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 2+kk 48 m²,0,"{'lat': 50.042065213607536, 'lon': 14.35449678...",False
4,"[[], [train]]",0,[Vlak 6 min. pěšky],False,"[[personal, brick, cellar], [small_shop, movie...","{'category_main_cb': 1, 'category_sub_cb': 4, ...",1,1,0,"{'favourite': {'is_favourite': False, '_links'...",...,1989653068,0,6583000,"{'value_raw': 6583000, 'unit': '', 'name': 'Ce...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 2+kk 45 m²,0,"{'lat': 50.05146021360754, 'lon': 14.458017786...",False


In [81]:
df.shape

(21, 27)

### 1c. Combining multiple requests into single df

* Function should parametrize:
    * `start_page` and `end_page`
    * request parameters
* construct a list of individual request dfs
* then feed it into `pd.concat` function

In [82]:
raw.shape

(21, 27)

In [83]:
request_sreality

<function __main__.request_sreality(page, category_main_str, category_type_str, locality_region_id=10)>

In [84]:
def request_multiply_sreality(start_page, end_page, category_main_str, category_type_str, locality_region_id=10):
    pages_to_request = range(start_page, end_page+1)
    list_of_dfs = []
    for page in pages_to_request:
        json_data = request_sreality(page, category_main_str, category_type_str, locality_region_id)
        if json_data is None:
            continue
        df = convert_sreality_data_to_df(json_data)
        list_of_dfs.append(df)
    
    return pd.concat(list_of_dfs)

df = request_multiply_sreality(1, 5, 'flat', 'sell', 10)
df.shape

(105, 27)

In [85]:
df.shape

(105, 27)

In [86]:
df.head()

Unnamed: 0,labelsReleased,has_panorama,labels,is_auction,labelsAll,seo,exclusively_at_rk,category,has_floor_plan,_embedded,...,hash_id,attractive_offer,price,price_czk,_links,rus,name,region_tip,gps,has_matterport_url
0,"[[in_construction, garage], [train]]",0,"[Ve výstavbě, Garáž, Vlak 1 min. pěšky]",False,"[[in_construction, personal, terrace, cellar, ...","{'category_main_cb': 1, 'category_sub_cb': 8, ...",1,1,1,"{'favourite': {'is_favourite': False, '_links'...",...,181654092,0,14518000,"{'value_raw': 14518000, 'unit': '', 'name': 'C...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 4+kk 100 m²,3191081,"{'lat': 50.02321621360754, 'lon': 14.601693786...",False
1,"[[partly_furnished], [post_office]]",0,"[Částečně vybavený, Pošta 6 min. pěšky]",False,"[[personal, balcony, brick, cellar, elevator, ...","{'category_main_cb': 1, 'category_sub_cb': 6, ...",0,1,1,"{'favourite': {'is_favourite': False, '_links'...",...,131142988,0,31293000,"{'value_raw': 31293000, 'unit': '', 'name': 'C...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 3+kk 160 m² (Podkrovní),0,"{'lat': 50.07438321360754, 'lon': 14.417012786...",False
2,"[[partly_furnished], []]",0,[Částečně vybavený],False,"[[personal, brick, partly_furnished], [tavern,...","{'category_main_cb': 1, 'category_sub_cb': 6, ...",1,1,0,"{'favourite': {'is_favourite': False, '_links'...",...,288179532,0,10057000,"{'value_raw': 10057000, 'unit': '', 'name': 'C...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 3+kk 96 m² (Podkrovní),0,"{'lat': 50.05343621360754, 'lon': 14.399236786...",False
3,"[[panel, furnished], [medic, drugstore]]",0,"[Panelová, Vybavený, Lékař 7 min. pěšky, Lékár...",False,"[[personal, panel, cellar, elevator, furnished...","{'category_main_cb': 1, 'category_sub_cb': 4, ...",1,1,0,"{'favourite': {'is_favourite': False, '_links'...",...,3775754828,0,6935000,"{'value_raw': 6935000, 'unit': '', 'name': 'Ce...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 2+kk 48 m²,0,"{'lat': 50.042065213607536, 'lon': 14.35449678...",False
4,"[[], [train]]",0,[Vlak 6 min. pěšky],False,"[[personal, brick, cellar], [small_shop, movie...","{'category_main_cb': 1, 'category_sub_cb': 4, ...",1,1,0,"{'favourite': {'is_favourite': False, '_links'...",...,1989653068,0,6583000,"{'value_raw': 6583000, 'unit': '', 'name': 'Ce...",{'dynamicDown': [{'href': 'https://d18-a.sdn.c...,False,Prodej bytu 2+kk 45 m²,0,"{'lat': 50.05146021360754, 'lon': 14.458017786...",False


In [87]:
d

{'meta_description': '4031 realit v nabídce prodej bytů Praha. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.',
 'result_size': 4031,
 '_embedded': {'estates': [{'labelsReleased': [['in_construction'], ['train']],
    'has_panorama': 0,
    'labels': ['Ve výstavbě', 'Vlak 1 min. pěšky'],
    'is_auction': False,
    'labelsAll': [['in_construction',
      'personal',
      'terrace',
      'cellar',
      'elevator',
      'parking_lots',
      'garage'],
     ['vet',
      'small_shop',
      'tavern',
      'candy_shop',
      'playground',
      'theater',
      'movies',
      'atm',
      'drugstore',
      'medic',
      'restaurant',
      'kindergarten',
      'bus_public_transport',
      'train',
      'sports',
      'post_office',
      'school',
      'tram',
      'metro',
      'shop']],
    'seo': {'category_main_cb': 1,
     'category_sub_cb': 8,
     'category_type_cb': 1,
     'locality': 'praha-uhrineves-'

In [88]:
df.reset_index(drop=True, inplace=True)

In [89]:
df.index

RangeIndex(start=0, stop=105, step=1)

#### Task 2: Cleaning data

__2a. Filter columns__
* filter only columns: `['locality', 'price', 'name', 'gps','hash_id','exclusively_at_rk']`
* use `.copy()` to avoid `SettingWithCopyWarning` later


In [91]:
df.columns

Index(['labelsReleased', 'has_panorama', 'labels', 'is_auction', 'labelsAll',
       'seo', 'exclusively_at_rk', 'category', 'has_floor_plan', '_embedded',
       'paid_logo', 'locality', 'has_video', 'advert_images_count', 'new',
       'auctionPrice', 'type', 'hash_id', 'attractive_offer', 'price',
       'price_czk', '_links', 'rus', 'name', 'region_tip', 'gps',
       'has_matterport_url'],
      dtype='object')

Copy selection into new one

In [92]:
df_clean = df[['locality', 'price', 'name', 'gps', 'hash_id', 'exclusively_at_rk']].copy()
df_clean

Unnamed: 0,locality,price,name,gps,hash_id,exclusively_at_rk
0,Praha 10 - Uhříněves,14518000,Prodej bytu 4+kk 100 m²,"{'lat': 50.02321621360754, 'lon': 14.601693786...",181654092,1
1,Praha 1 - Malá Strana,31293000,Prodej bytu 3+kk 160 m² (Podkrovní),"{'lat': 50.07438321360754, 'lon': 14.417012786...",131142988,0
2,Praha 5 - Smíchov,10057000,Prodej bytu 3+kk 96 m² (Podkrovní),"{'lat': 50.05343621360754, 'lon': 14.399236786...",288179532,1
3,Praha 5 - Stodůlky,6935000,Prodej bytu 2+kk 48 m²,"{'lat': 50.042065213607536, 'lon': 14.35449678...",3775754828,1
4,Praha 4 - Nusle,6583000,Prodej bytu 2+kk 45 m²,"{'lat': 50.05146021360754, 'lon': 14.458017786...",1989653068,1
...,...,...,...,...,...,...
100,Praha 5 - Smíchov,8594000,Prodej bytu 2+kk 57 m²,"{'lat': 50.05944121360754, 'lon': 14.418958786...",2995417676,0
101,Praha 9 - Vysočany,6694000,Prodej bytu 2+kk 59 m²,"{'lat': 50.08976821360754, 'lon': 14.501791786...",2848944716,0
102,Praha 9 - Vysočany,5197000,Prodej bytu 1+kk 34 m²,"{'lat': 50.08976821360754, 'lon': 14.501791786...",3402592844,0
103,Praha 9 - Vysočany,13356000,Prodej bytu 4+kk 137 m²,"{'lat': 50.08976821360754, 'lon': 14.501791786...",181367372,0


### 2b: GPS
* Convert dictionary in `gps` column into two columns - `lat` and `lon`
* use apply function on gps column
* Note apply can return multiple columns

In [98]:
df_clean.loc[1:2, 'gps']

1    {'lat': 50.07438321360754, 'lon': 14.417012786...
2    {'lat': 50.05343621360754, 'lon': 14.399236786...
Name: gps, dtype: object

In [99]:
df_clean[['lat', 'lon']] = df_clean.gps.apply(lambda x: pd.Series({'lat': x['lat'], 'lon': x['lon']}))

In [100]:
df_clean.head()

Unnamed: 0,locality,price,name,gps,hash_id,exclusively_at_rk,lat,lon
0,Praha 10 - Uhříněves,14518000,Prodej bytu 4+kk 100 m²,"{'lat': 50.02321621360754, 'lon': 14.601693786...",181654092,1,50.023216,14.601694
1,Praha 1 - Malá Strana,31293000,Prodej bytu 3+kk 160 m² (Podkrovní),"{'lat': 50.07438321360754, 'lon': 14.417012786...",131142988,0,50.074383,14.417013
2,Praha 5 - Smíchov,10057000,Prodej bytu 3+kk 96 m² (Podkrovní),"{'lat': 50.05343621360754, 'lon': 14.399236786...",288179532,1,50.053436,14.399237
3,Praha 5 - Stodůlky,6935000,Prodej bytu 2+kk 48 m²,"{'lat': 50.042065213607536, 'lon': 14.35449678...",3775754828,1,50.042065,14.354497
4,Praha 4 - Nusle,6583000,Prodej bytu 2+kk 45 m²,"{'lat': 50.05146021360754, 'lon': 14.458017786...",1989653068,1,50.05146,14.458018


### 2c. Get flat type from name

* Name is always represented by string `Prodej bytu [type of flat] [Area] m^2`
* Try picking third word in string
* Check meaningfulness using `.value_counts()`

In [101]:
df_clean.name.apply(lambda x: x.split(' '))[:5]

0                 [Prodej, bytu, 4+kk, 100 m²]
1    [Prodej, bytu, 3+kk, 160 m², (Podkrovní)]
2     [Prodej, bytu, 3+kk, 96 m², (Podkrovní)]
3                  [Prodej, bytu, 2+kk, 48 m²]
4                  [Prodej, bytu, 2+kk, 45 m²]
Name: name, dtype: object

In [102]:
df_clean.name.apply(lambda x: x.split())[:5]

0                 [Prodej, bytu, 4+kk, 100, m²]
1    [Prodej, bytu, 3+kk, 160, m², (Podkrovní)]
2     [Prodej, bytu, 3+kk, 96, m², (Podkrovní)]
3                  [Prodej, bytu, 2+kk, 48, m²]
4                  [Prodej, bytu, 2+kk, 45, m²]
Name: name, dtype: object

In [103]:
df_clean['type_of_flat'] = df_clean.name.apply(lambda x: x.split(' ')[2])

In [104]:
df_clean['flat_type'] = df_clean.name.apply(lambda x: x.split()[2])
df_clean.head()

Unnamed: 0,locality,price,name,gps,hash_id,exclusively_at_rk,lat,lon,type_of_flat,flat_type
0,Praha 10 - Uhříněves,14518000,Prodej bytu 4+kk 100 m²,"{'lat': 50.02321621360754, 'lon': 14.601693786...",181654092,1,50.023216,14.601694,4+kk,4+kk
1,Praha 1 - Malá Strana,31293000,Prodej bytu 3+kk 160 m² (Podkrovní),"{'lat': 50.07438321360754, 'lon': 14.417012786...",131142988,0,50.074383,14.417013,3+kk,3+kk
2,Praha 5 - Smíchov,10057000,Prodej bytu 3+kk 96 m² (Podkrovní),"{'lat': 50.05343621360754, 'lon': 14.399236786...",288179532,1,50.053436,14.399237,3+kk,3+kk
3,Praha 5 - Stodůlky,6935000,Prodej bytu 2+kk 48 m²,"{'lat': 50.042065213607536, 'lon': 14.35449678...",3775754828,1,50.042065,14.354497,2+kk,2+kk
4,Praha 4 - Nusle,6583000,Prodej bytu 2+kk 45 m²,"{'lat': 50.05146021360754, 'lon': 14.458017786...",1989653068,1,50.05146,14.458018,2+kk,2+kk


In [105]:
df_clean['type_of_flat'].value_counts()

type_of_flat
2+kk          47
3+kk          23
1+kk          14
4+kk          13
atypické       2
2+1 101 m²     1
1+1 31 m²      1
2+1 86 m²      1
5+kk           1
2+1 83 m²      1
3+1 108 m²     1
Name: count, dtype: int64

In [106]:
df_clean['flat_type'].value_counts()

flat_type
2+kk        47
3+kk        23
1+kk        14
4+kk        13
2+1          3
atypické     2
1+1          1
5+kk         1
3+1          1
Name: count, dtype: int64

In [107]:
# regular expression to find numbers before m²
df_clean['area1'] = df_clean['name'].apply(lambda x: re.findall(r'(\d+)\s+m²', x)[0])

In [108]:
df_clean['area1']

0      100
1      160
2       96
3       48
4       45
      ... 
100     57
101     59
102     34
103    137
104     63
Name: area1, Length: 105, dtype: object

In [109]:
# regular expression to find characters before '+' and behind '+'
def flat_type(x:str):
    # \d+ represents one or more digit(s)
    # \s is a space character
    # \w+ is one or more word character(s)
    # * represents 0 or more
    ft = re.findall(r'(\d+)\s*\+\s*(\w+)', x)
    return ft

In [110]:
sample_text = 'Prodej bytu 3+kk 71 m²'
print(f"Initial text: {sample_text}")
print("Processed text:")
flat_type(sample_text)

Initial text: Prodej bytu 3+kk 71 m²
Processed text:


[('3', 'kk')]

In [111]:
df_clean['rooms'] = df_clean['name'].apply(flat_type)

In [112]:
df_clean.head() 

Unnamed: 0,locality,price,name,gps,hash_id,exclusively_at_rk,lat,lon,type_of_flat,flat_type,area1,rooms
0,Praha 10 - Uhříněves,14518000,Prodej bytu 4+kk 100 m²,"{'lat': 50.02321621360754, 'lon': 14.601693786...",181654092,1,50.023216,14.601694,4+kk,4+kk,100,"[(4, kk)]"
1,Praha 1 - Malá Strana,31293000,Prodej bytu 3+kk 160 m² (Podkrovní),"{'lat': 50.07438321360754, 'lon': 14.417012786...",131142988,0,50.074383,14.417013,3+kk,3+kk,160,"[(3, kk)]"
2,Praha 5 - Smíchov,10057000,Prodej bytu 3+kk 96 m² (Podkrovní),"{'lat': 50.05343621360754, 'lon': 14.399236786...",288179532,1,50.053436,14.399237,3+kk,3+kk,96,"[(3, kk)]"
3,Praha 5 - Stodůlky,6935000,Prodej bytu 2+kk 48 m²,"{'lat': 50.042065213607536, 'lon': 14.35449678...",3775754828,1,50.042065,14.354497,2+kk,2+kk,48,"[(2, kk)]"
4,Praha 4 - Nusle,6583000,Prodej bytu 2+kk 45 m²,"{'lat': 50.05146021360754, 'lon': 14.458017786...",1989653068,1,50.05146,14.458018,2+kk,2+kk,45,"[(2, kk)]"


### 2d. Get the area of a flat from name

* Naive: select the word before last word
* Then try navigating using the index of `'m²'`
* If this also fail, then you will need to use regex - `import re`

In [113]:
def name_to_area(nm):
    splitted = nm.split()
    m_idx = splitted.index('m²')
    return int(splitted[m_idx-1])

df_clean['area2'] = df_clean.name.apply(name_to_area)
df_clean.head()

Unnamed: 0,locality,price,name,gps,hash_id,exclusively_at_rk,lat,lon,type_of_flat,flat_type,area1,rooms,area2
0,Praha 10 - Uhříněves,14518000,Prodej bytu 4+kk 100 m²,"{'lat': 50.02321621360754, 'lon': 14.601693786...",181654092,1,50.023216,14.601694,4+kk,4+kk,100,"[(4, kk)]",100
1,Praha 1 - Malá Strana,31293000,Prodej bytu 3+kk 160 m² (Podkrovní),"{'lat': 50.07438321360754, 'lon': 14.417012786...",131142988,0,50.074383,14.417013,3+kk,3+kk,160,"[(3, kk)]",160
2,Praha 5 - Smíchov,10057000,Prodej bytu 3+kk 96 m² (Podkrovní),"{'lat': 50.05343621360754, 'lon': 14.399236786...",288179532,1,50.053436,14.399237,3+kk,3+kk,96,"[(3, kk)]",96
3,Praha 5 - Stodůlky,6935000,Prodej bytu 2+kk 48 m²,"{'lat': 50.042065213607536, 'lon': 14.35449678...",3775754828,1,50.042065,14.354497,2+kk,2+kk,48,"[(2, kk)]",48
4,Praha 4 - Nusle,6583000,Prodej bytu 2+kk 45 m²,"{'lat': 50.05146021360754, 'lon': 14.458017786...",1989653068,1,50.05146,14.458018,2+kk,2+kk,45,"[(2, kk)]",45


In [114]:
df_clean[df_clean["area1"] == df_clean["area2"]]

Unnamed: 0,locality,price,name,gps,hash_id,exclusively_at_rk,lat,lon,type_of_flat,flat_type,area1,rooms,area2


None of the elements is equal.

In [115]:
type(df_clean['area1'][0]), type(df_clean['area2'][0])

(str, numpy.int64)

In [116]:
df_clean[df_clean['area1'].astype(int)==df_clean['area2']]

Unnamed: 0,locality,price,name,gps,hash_id,exclusively_at_rk,lat,lon,type_of_flat,flat_type,area1,rooms,area2
0,Praha 10 - Uhříněves,14518000,Prodej bytu 4+kk 100 m²,"{'lat': 50.02321621360754, 'lon': 14.601693786...",181654092,1,50.023216,14.601694,4+kk,4+kk,100,"[(4, kk)]",100
1,Praha 1 - Malá Strana,31293000,Prodej bytu 3+kk 160 m² (Podkrovní),"{'lat': 50.07438321360754, 'lon': 14.417012786...",131142988,0,50.074383,14.417013,3+kk,3+kk,160,"[(3, kk)]",160
2,Praha 5 - Smíchov,10057000,Prodej bytu 3+kk 96 m² (Podkrovní),"{'lat': 50.05343621360754, 'lon': 14.399236786...",288179532,1,50.053436,14.399237,3+kk,3+kk,96,"[(3, kk)]",96
3,Praha 5 - Stodůlky,6935000,Prodej bytu 2+kk 48 m²,"{'lat': 50.042065213607536, 'lon': 14.35449678...",3775754828,1,50.042065,14.354497,2+kk,2+kk,48,"[(2, kk)]",48
4,Praha 4 - Nusle,6583000,Prodej bytu 2+kk 45 m²,"{'lat': 50.05146021360754, 'lon': 14.458017786...",1989653068,1,50.051460,14.458018,2+kk,2+kk,45,"[(2, kk)]",45
...,...,...,...,...,...,...,...,...,...,...,...,...,...
100,Praha 5 - Smíchov,8594000,Prodej bytu 2+kk 57 m²,"{'lat': 50.05944121360754, 'lon': 14.418958786...",2995417676,0,50.059441,14.418959,2+kk,2+kk,57,"[(2, kk)]",57
101,Praha 9 - Vysočany,6694000,Prodej bytu 2+kk 59 m²,"{'lat': 50.08976821360754, 'lon': 14.501791786...",2848944716,0,50.089768,14.501792,2+kk,2+kk,59,"[(2, kk)]",59
102,Praha 9 - Vysočany,5197000,Prodej bytu 1+kk 34 m²,"{'lat': 50.08976821360754, 'lon': 14.501791786...",3402592844,0,50.089768,14.501792,1+kk,1+kk,34,"[(1, kk)]",34
103,Praha 9 - Vysočany,13356000,Prodej bytu 4+kk 137 m²,"{'lat': 50.08976821360754, 'lon': 14.501791786...",181367372,0,50.089768,14.501792,4+kk,4+kk,137,"[(4, kk)]",137
