# Quick data inspection
## Create a dataframe


In [22]:
import pandas as pd

df = pd.DataFrame({
    "zip_code": [11777, 85278, 7086, 7086, 11777, 85278, 85361, 7086],
    "price_thousand_usd": [250, 320, 180, 450, 390, 275, 500, 210],
    "size_square_feet": [1800, 2200, 1200, 3000, 2500, 1900, 3500, 1400],
    "num_bedrooms": [3, 4, 2, 5, 4, 3, 6, 2],
    "property_type": ["House", "House", "Condo", "House",
                      "Townhouse", "Condo", "House", "Condo"],
    "condition_rating": ["Fair", "Good", "Good", "Excellent",
                         "Very Good", "Fair", "Excellent", "Good"],
    "stove_type": ["Gas", "Electric", "Electric", "Gas",
                   "Gas", "Electric", "Gas", "Electric"]
})

df

Unnamed: 0,zip_code,price_thousand_usd,size_square_feet,num_bedrooms,property_type,condition_rating,stove_type
0,11777,250,1800,3,House,Fair,Gas
1,85278,320,2200,4,House,Good,Electric
2,7086,180,1200,2,Condo,Good,Electric
3,7086,450,3000,5,House,Excellent,Gas
4,11777,390,2500,4,Townhouse,Very Good,Gas
5,85278,275,1900,3,Condo,Fair,Electric
6,85361,500,3500,6,House,Excellent,Gas
7,7086,210,1400,2,Condo,Good,Electric


## Inspect type

We now have a preliminary understanding of the data:
- `zip_code`: seems numerical, but probably need to convert to categorical
- `price_thousand_usd`: numerical (continuous, aka, float), price in thousand dollars
- `size_squrre_feet`: numerical (continuous, aka, float), size in square feet
-  `num_bedrooms`: numerical (discrete, aka, integer), the number of bedrooms
- `property_type`: categorical (nominal)
- `condition_rating`: categorical (ordinal), from `Fair` to `Excellent`
-  `stove_type`: categorical (nominal)

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   zip_code            8 non-null      int64 
 1   price_thousand_usd  8 non-null      int64 
 2   size_square_feet    8 non-null      int64 
 3   num_bedrooms        8 non-null      int64 
 4   property_type       8 non-null      object
 5   condition_rating    8 non-null      object
 6   stove_type          8 non-null      object
dtypes: int64(4), object(3)
memory usage: 580.0+ bytes


## Inspect numerical variables

Units matter
- The unit of `price_thousand_usd' is thousand dollars, not dollars
- The unit for size is square feet, not square meters



In [42]:
df.describe()

Unnamed: 0,price_thousand_usd,size_square_feet,num_bedrooms
count,8.0,8.0,8.0
mean,321.875,2187.5,3.625
std,115.137728,784.560842,1.407886
min,180.0,1200.0,2.0
25%,240.0,1700.0,2.75
50%,297.5,2050.0,3.5
75%,405.0,2625.0,4.25
max,500.0,3500.0,6.0


If you need to conduct calculation using dollars, you should create a new column to represent the price in dollars

In [43]:
df['price_usd'] = df['price_thousand_usd'] * 1000

df

Unnamed: 0,zip_code,price_thousand_usd,size_square_feet,num_bedrooms,property_type,condition_rating,stove_type,price_usd
0,11777,250,1800,3,House,Fair,Gas,250000
1,85278,320,2200,4,House,Good,Electric,320000
2,7086,180,1200,2,Condo,Good,Electric,180000
3,7086,450,3000,5,House,Excellent,Gas,450000
4,11777,390,2500,4,Townhouse,Very Good,Gas,390000
5,85278,275,1900,3,Condo,Fair,Electric,275000
6,85361,500,3500,6,House,Excellent,Gas,500000
7,7086,210,1400,2,Condo,Good,Electric,210000


Notice:
- `zip_code` appears to be numerical, but behaves more like categorical.
- It is captured as integers, which can lead to the following problems:
    - it might by accident be transformed into float in calculation, which make no senses
    - zip codes can have 0 at the beginning, but integer cannot

A common solution is to:
- convert `zip_code` into a string, and
- add leading zeros to fill to five digits

This usually happens when you deal with unique IDs, e.g., GeoID for places, patient IDs, zip codes, FIPS (Federal Information Processing Standard) codes, etc. Therefore, it is important to understand what data you are looking at and process accordingly.


In [25]:
df['zip_code'] = df['zip_code'].astype(str).str.zfill(5)


In [26]:
df['zip_code']

0    11777
1    85278
2    07086
3    07086
4    11777
5    85278
6    85361
7    07086
Name: zip_code, dtype: object

## Inspect categorical variables

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   zip_code            8 non-null      object
 1   price_thousand_usd  8 non-null      int64 
 2   size_square_feet    8 non-null      int64 
 3   num_bedrooms        8 non-null      int64 
 4   property_type       8 non-null      object
 5   condition_rating    8 non-null      object
 6   stove_type          8 non-null      object
dtypes: int64(3), object(4)
memory usage: 580.0+ bytes


In [41]:
categorical_cols = df.select_dtypes(exclude=['number']).columns

for col in categorical_cols:
    print(df[col].value_counts())
    print('\n')

zip_code
07086    3
11777    2
85278    2
85361    1
Name: count, dtype: int64


property_type
House        4
Condo        3
Townhouse    1
Name: count, dtype: int64


condition_rating
Good         3
Fair         2
Excellent    2
Very Good    1
Name: count, dtype: int64


stove_type
Gas         4
Electric    4
Name: count, dtype: int64




### Convert nominal data to `category` dtype


In [45]:
for nominal_data in ['zip_code', 'property_type', 'stove_type']:
    df[nominal_data] = df[nominal_data].astype('category')

In [46]:
df.dtypes

zip_code              category
price_thousand_usd       int64
size_square_feet         int64
num_bedrooms             int64
property_type         category
condition_rating        object
stove_type            category
price_usd                int64
dtype: object

### Convert ordinal data to ordered category dtype

Note: A categorical column is not equal to category dtype. It needs to be cast into a category dtype using `.astype('category')`

When casting ordinal data into category dtype, it is recommended to cast it into an ordered category dtype, to maintain the order. Two scenarios:
- If we are casting a column that is not yet category dtype, we can use `.astype()` to cast it into `pd.CategoricalDtype(categories=order_you_want_to_maintain, ordered=True)`.
- If you want to cast a column of category dtype to ordered category dtype, you can use one of the following two methods:
    - `.cat.set_categories(new_categories, ordered=True)`
    - `.cat.reorder_categories(new_categories, ordered=True)`

In [47]:
# first inspect the unique values in condition_rating
df['condition_rating'].unique()

array(['Fair', 'Good', 'Excellent', 'Very Good'], dtype=object)

In [None]:
condition_rating_order = ['Fair', 'Good', 'Very Good', 'Excellent']

df['condition_rating'] = df['condition_rating'].astype(pd.CategoricalDtype(categories=condition_rating_order, ordered=True))

df['condition_rating'].value_counts(sort=False) # set sort=False to maintain the order we set for condition_rating, instead of the order of count

#### Can we convert ordinal data into integers?

You absolutely can, although many times it is not considered the best practice.

However, in reality, many datasets are prepared in a way where ordinal data is recorded as integers. That said, the most important thing is to understand the data, understand what is appropriate way to deal with the data. Therefore, if you convert the `condition_rating` to integers, remember you can only rank them, but you cannot do arithmetic operation with those integers.

In [80]:
# if we already have a ordered category dtype, we can simply use .cat.codes
df['condition_rating_numeric'] = df['condition_rating'].cat.codes

df[['condition_rating', 'condition_rating_numeric']]

Unnamed: 0,condition_rating,condition_rating_numeric
0,Fair,0
1,Good,1
2,Good,1
3,Excellent,3
4,Very Good,2
5,Fair,0
6,Excellent,3
7,Good,1


In [81]:
# in this case, the integer column's dtype is integer
df['condition_rating_numeric'].dtypes

dtype('int8')

In [82]:
# you can also use .cat.rename_categories

df['condition_rating_numeric'] = df['condition_rating'].cat.rename_categories([0, 1, 2, 3])

df[['condition_rating','condition_rating_numeric']]

Unnamed: 0,condition_rating,condition_rating_numeric
0,Fair,0
1,Good,1
2,Good,1
3,Excellent,3
4,Very Good,2
5,Fair,0
6,Excellent,3
7,Good,1


In [79]:
# in this case, the type of this column is still category dtype
df['condition_rating_numeric'].dtypes

CategoricalDtype(categories=[0, 1, 2, 3], ordered=True, categories_dtype=int64)

### Cross tabulation

In [56]:
pd.crosstab(df['condition_rating'], df['stove_type'])

stove_type,Electric,Gas
condition_rating,Unnamed: 1_level_1,Unnamed: 2_level_1
Fair,1,1
Good,3,0
Very Good,0,1
Excellent,0,2


### Group inspect
If we want to inspect data within each group of a categorical variable, we can use `.groupby()`.


In [60]:
df.groupby(by="property_type", observed=True)['price_usd'].mean()

property_type
Condo        221666.666667
House        380000.000000
Townhouse    390000.000000
Name: price_usd, dtype: float64

In [62]:
df.groupby(by='zip_code', observed=True)['size_square_feet'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
zip_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7086,3.0,1866.666667,986.576572,1200.0,1300.0,1400.0,2200.0,3000.0
11777,2.0,2150.0,494.974747,1800.0,1975.0,2150.0,2325.0,2500.0
85278,2.0,2050.0,212.132034,1900.0,1975.0,2050.0,2125.0,2200.0
85361,1.0,3500.0,,3500.0,3500.0,3500.0,3500.0,3500.0


In [63]:
df.groupby(by='property_type', observed=True).agg(
    {'price_usd':['mean','sum'],
     'size_square_feet': ['mean','sum']}
)

Unnamed: 0_level_0,price_usd,price_usd,size_square_feet,size_square_feet
Unnamed: 0_level_1,mean,sum,mean,sum
property_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Condo,221666.666667,665000,1500.0,4500
House,380000.0,1520000,2625.0,10500
Townhouse,390000.0,390000,2500.0,2500


You can even add custom functions to `.groupby()`

In [None]:
df.groupby(by='property_type', observed=True).agg(
    mean_price_in_thousand_CNY = pd.NamedAgg(column='price_thousand_usd', aggfunc=lambda s: s.mean()*6.91),
    mean_size_in_square_meters = pd.NamedAgg(column='size_square_feet', aggfunc=lambda s: s.mean()/10.76),
)