In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Cleaning data 2

In Part 1, we have identified missing values and cleaned the data so that it does not contain missing values. In the list of steps below, we reached almost the end of step 3.

1. Data collection &ndash; finding and collecting relevant data.
2. Assessing the data &ndash; it is important to know what is in the data, how it was produced, and possibly to asses the quality and reliability of the data.
3. Cleaning and validating data &ndash; filling in gaps, removing invalid data:
   * treating missing values &ndash; removing whole records or inserting some values,
   * removing outliers, and
   * validation &ndash; examine the data for errors.
4. Transformation and enriching &ndash; normalizing the data and possibly adding  related information to provide deeper insights.
5. Storing the cleaned data.

Now, we will continue with validation and transformation of the 

# Validating data

Read the table `property_data_clean1.csv`.

In [2]:
df = pd.read_csv('data/property_data_clean1.csv')
df

Unnamed: 0.1,Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,AREA
0,0,100001000.0,4.0,Evropská,Y,3.0,1.0,90
1,1,100002000.0,97.0,Stroměstské náměstí,N,3.0,1.0,112
2,2,100003000.0,125.0,Opatovská,N,2.5,1.0,80
3,3,100004000.0,20.0,Opatovská,Y,1.0,1.0,71
4,4,100000125.0,203.0,Evropská,Y,3.0,2.0,160
5,5,100006000.0,27.0,Přemyslovská,Y,2.5,1.0,112
6,6,100007000.0,1.0,Ruská,Y,2.0,1.0,95
7,7,100008000.0,213.0,Italská,Y,1.0,1.0,112
8,8,100009000.0,56.0,Italská,Y,2.5,2.0,180


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    9 non-null      int64  
 1   PID           9 non-null      float64
 2   ST_NUM        9 non-null      float64
 3   ST_NAME       9 non-null      object 
 4   OWN_OCCUPIED  9 non-null      object 
 5   NUM_BEDROOMS  9 non-null      float64
 6   NUM_BATH      9 non-null      float64
 7   AREA          9 non-null      int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 704.0+ bytes


We can see that the columns `ST_NAME` and `OWN_OCCUPIED` are of type `object`. This is not suitable for further processing. We should convert the values into either categorigcal type, Boolean or a numenric type. 

In [4]:
# convert ST_NAME into categorical type
# YOUR CODE HERE
df['ST_NAME'] = df['ST_NAME'].astype("category")

Using a simple code for converting values in the column `OWN_OCCUPIED` does not work - correct it!

In [5]:
# convert OWN_OCCUPIED into a Boolean
# correct the coode below
# YOUR CODE HERE
df["OWN_OCCUPIED"] = df["OWN_OCCUPIED"].map({"Y":True,"N":False})

Drop the colun `Unnamed: 0` that does not carry any information. 

In [6]:
# YOUR CODE HERE
df=df.drop(columns=["Unnamed: 0"])
df

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,AREA
0,100001000.0,4.0,Evropská,True,3.0,1.0,90
1,100002000.0,97.0,Stroměstské náměstí,False,3.0,1.0,112
2,100003000.0,125.0,Opatovská,False,2.5,1.0,80
3,100004000.0,20.0,Opatovská,True,1.0,1.0,71
4,100000125.0,203.0,Evropská,True,3.0,2.0,160
5,100006000.0,27.0,Přemyslovská,True,2.5,1.0,112
6,100007000.0,1.0,Ruská,True,2.0,1.0,95
7,100008000.0,213.0,Italská,True,1.0,1.0,112
8,100009000.0,56.0,Italská,True,2.5,2.0,180


In [7]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   PID           9 non-null      float64 
 1   ST_NUM        9 non-null      float64 
 2   ST_NAME       9 non-null      category
 3   OWN_OCCUPIED  9 non-null      bool    
 4   NUM_BEDROOMS  9 non-null      float64 
 5   NUM_BATH      9 non-null      float64 
 6   AREA          9 non-null      int64   
dtypes: bool(1), category(1), float64(4), int64(1)
memory usage: 726.0 bytes
None


Most of the algorithms used in data mining requires that the attributes are just numbers. For that we can convert categorical attributes into numerical attributes. As categorical values usually do not have any particular order, one-hot-encoding is the proper way for converting categical values into integers. Using the function `pd.get_dummies()` convert the street names into numerical attributes.

In [8]:
# YOUR CODE HERE
df = pd.get_dummies(df,columns = ["ST_NAME"])
df

Unnamed: 0,PID,ST_NUM,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,AREA,ST_NAME_Evropská,ST_NAME_Italská,ST_NAME_Opatovská,ST_NAME_Přemyslovská,ST_NAME_Ruská,ST_NAME_Stroměstské náměstí
0,100001000.0,4.0,True,3.0,1.0,90,True,False,False,False,False,False
1,100002000.0,97.0,False,3.0,1.0,112,False,False,False,False,False,True
2,100003000.0,125.0,False,2.5,1.0,80,False,False,True,False,False,False
3,100004000.0,20.0,True,1.0,1.0,71,False,False,True,False,False,False
4,100000125.0,203.0,True,3.0,2.0,160,True,False,False,False,False,False
5,100006000.0,27.0,True,2.5,1.0,112,False,False,False,True,False,False
6,100007000.0,1.0,True,2.0,1.0,95,False,False,False,False,True,False
7,100008000.0,213.0,True,1.0,1.0,112,False,True,False,False,False,False
8,100009000.0,56.0,True,2.5,2.0,180,False,True,False,False,False,False


In [9]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   PID                          9 non-null      float64
 1   ST_NUM                       9 non-null      float64
 2   OWN_OCCUPIED                 9 non-null      bool   
 3   NUM_BEDROOMS                 9 non-null      float64
 4   NUM_BATH                     9 non-null      float64
 5   AREA                         9 non-null      int64  
 6   ST_NAME_Evropská             9 non-null      bool   
 7   ST_NAME_Italská              9 non-null      bool   
 8   ST_NAME_Opatovská            9 non-null      bool   
 9   ST_NAME_Přemyslovská         9 non-null      bool   
 10  ST_NAME_Ruská                9 non-null      bool   
 11  ST_NAME_Stroměstské náměstí  9 non-null      bool   
dtypes: bool(7), float64(4), int64(1)
memory usage: 551.0 bytes
None


Save the final table in a CSV-file `property_data_clean2.csv`). 

In [13]:
# YOUR CODE HERE
df.to_csv("data/cleaned2.csv")

Now it is possible to convert the table into a numpy array. This is possible in more than one way, e.g.,
* using DataFrame method `to_numpy()` 
* using DataFrame to initialize an `np.array`
* using DataFrame to initialize an `np.array`together with specifying the type of elements of the array
* ...
Let us compare them


In [16]:
# using to_numpy
# YOUR CODE HERE
df_to_numpy = df.to_numpy()
df_to_numpy

array([[100001000.0, 4.0, True, 3.0, 1.0, 90, True, False, False, False,
        False, False],
       [100002000.0, 97.0, False, 3.0, 1.0, 112, False, False, False,
        False, False, True],
       [100003000.0, 125.0, False, 2.5, 1.0, 80, False, False, True,
        False, False, False],
       [100004000.0, 20.0, True, 1.0, 1.0, 71, False, False, True, False,
        False, False],
       [100000125.0, 203.0, True, 3.0, 2.0, 160, True, False, False,
        False, False, False],
       [100006000.0, 27.0, True, 2.5, 1.0, 112, False, False, False,
        True, False, False],
       [100007000.0, 1.0, True, 2.0, 1.0, 95, False, False, False, False,
        True, False],
       [100008000.0, 213.0, True, 1.0, 1.0, 112, False, True, False,
        False, False, False],
       [100009000.0, 56.0, True, 2.5, 2.0, 180, False, True, False,
        False, False, False]], dtype=object)

Pandas's function for converting tables into numpy arrays tries to preserve the data types of columns. 

In [15]:
# using simple constructor from numpy
np.array(df)

array([[100001000.0, 4.0, True, 3.0, 1.0, 90, True, False, False, False,
        False, False],
       [100002000.0, 97.0, False, 3.0, 1.0, 112, False, False, False,
        False, False, True],
       [100003000.0, 125.0, False, 2.5, 1.0, 80, False, False, True,
        False, False, False],
       [100004000.0, 20.0, True, 1.0, 1.0, 71, False, False, True, False,
        False, False],
       [100000125.0, 203.0, True, 3.0, 2.0, 160, True, False, False,
        False, False, False],
       [100006000.0, 27.0, True, 2.5, 1.0, 112, False, False, False,
        True, False, False],
       [100007000.0, 1.0, True, 2.0, 1.0, 95, False, False, False, False,
        True, False],
       [100008000.0, 213.0, True, 1.0, 1.0, 112, False, True, False,
        False, False, False],
       [100009000.0, 56.0, True, 2.5, 2.0, 180, False, True, False,
        False, False, False]], dtype=object)

The result is the same as using `to_numpy()`

In [17]:
# using constructor from numpy together with specifying the type of elemnts of the array
np.array(df,dtype=float)

array([[1.00001000e+08, 4.00000000e+00, 1.00000000e+00, 3.00000000e+00,
        1.00000000e+00, 9.00000000e+01, 1.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.00002000e+08, 9.70000000e+01, 0.00000000e+00, 3.00000000e+00,
        1.00000000e+00, 1.12000000e+02, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.00003000e+08, 1.25000000e+02, 0.00000000e+00, 2.50000000e+00,
        1.00000000e+00, 8.00000000e+01, 0.00000000e+00, 0.00000000e+00,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.00004000e+08, 2.00000000e+01, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 7.10000000e+01, 0.00000000e+00, 0.00000000e+00,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.00000125e+08, 2.03000000e+02, 1.00000000e+00, 3.00000000e+00,
        2.00000000e+00, 1.60000000e+02, 1.00000000e+00, 0.00

A homogenuous numpy array is usually the representation required in many algorithms in data mining. 
1. Which of the above methods is suitable for applying matrix computations in `numpy`?
2. Add some parameter(s) when using `.to_numpy()` method to obtain a homogeneuous numpy array.


In [19]:
# YOUR CODE HERE
df_to_numpy = df.to_numpy(dtype = float)
df_to_numpy

array([[1.00001000e+08, 4.00000000e+00, 1.00000000e+00, 3.00000000e+00,
        1.00000000e+00, 9.00000000e+01, 1.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.00002000e+08, 9.70000000e+01, 0.00000000e+00, 3.00000000e+00,
        1.00000000e+00, 1.12000000e+02, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.00003000e+08, 1.25000000e+02, 0.00000000e+00, 2.50000000e+00,
        1.00000000e+00, 8.00000000e+01, 0.00000000e+00, 0.00000000e+00,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.00004000e+08, 2.00000000e+01, 1.00000000e+00, 1.00000000e+00,
        1.00000000e+00, 7.10000000e+01, 0.00000000e+00, 0.00000000e+00,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.00000125e+08, 2.03000000e+02, 1.00000000e+00, 3.00000000e+00,
        2.00000000e+00, 1.60000000e+02, 1.00000000e+00, 0.00

`numpy` has a binary format for storing data in files. Save the table using suitable function of `numpy` as 
1. `.npy` file`property_data_clean2.npy`,
2. `.npz` file `property_data_clean2.npz`,
3. `.npz` compressed file `property_data_clean2_comp.npz`.

What are the differences between the methods `np.save()`, `np.savez()` and `np.savez_compressed()`?

In [22]:
# YOUR CODE HERE
np.save("data/property_data_clean2.npy",df_to_numpy)
np.savez("data/property_data_clean2.npz",df_to_numpy)
np.savez_compressed("data/property_data_clean2_comp.npz",df_to_numpy)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=c47f9171-f5c4-4138-9c07-b8d9da706ce0' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>