<a href="https://colab.research.google.com/github/haiderali2017/my_exploratory_data_analyses/blob/main/Data_Indicator_3_RPPI_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
df = pd.read_csv('https://data.smartdublin.ie/dataset/4997223b-13b2-4c97-9e88-cd94c6d35aec/resource/1fa1e318-98af-4db4-a71c-7d87fb77f9b0/download/indicator-3-rppi-indicator-v3.csv')

  df = pd.read_csv('https://data.smartdublin.ie/dataset/4997223b-13b2-4c97-9e88-cd94c6d35aec/resource/1fa1e318-98af-4db4-a71c-7d87fb77f9b0/download/indicator-3-rppi-indicator-v3.csv')


# Data Exploration


In [None]:
df.head()

Unnamed: 0,Month Year,Dublin - PPI - All Residential (2015 = 100)\nDublin,National excluding Dublin - all residential properties\nNational Excl. Dublin,% YoY\nDublin,Column1,% YoY National \nexcl. Dublin,MoM \nDublin,MoM\nNational excl. Dublin,Column2,%MoM\nDublin,...,Unnamed: 16373,Unnamed: 16374,Unnamed: 16375,Unnamed: 16376,Unnamed: 16377,Unnamed: 16378,Unnamed: 16379,Unnamed: 16380,Unnamed: 16381,Unnamed: 16382
0,Jan 05,118.5,142.1,,,,,,,,...,,,,,,,,,,
1,Feb 05,121.0,142.2,,,,2.5,0.1,,2.1%,...,,,,,,,,,,
2,Mar 05,121.9,142.9,,,,0.9,0.7,,0.7%,...,,,,,,,,,,
3,Apr 05,122.7,143.8,,,,0.8,0.9,,0.7%,...,,,,,,,,,,
4,May 05,123.7,144.8,,,,1.0,1.0,,0.8%,...,,,,,,,,,,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239 entries, 0 to 238
Columns: 16383 entries, Month Year to Unnamed: 16382
dtypes: float64(16377), object(6)
memory usage: 29.9+ MB


# Data Cleaning


1.   Removing unnecessary columns
2. Renaming columns
3. Checking missing values
4. Removing '%' from column values



### 1. Removing unnecessary columns

In this step, we are dropping columns that are Unnamed. There are over 16000 columns like these.

In [None]:
df = df.drop(columns=[f'Unnamed: {i}' for i in range(1, 16383)], errors='ignore') # dropping all columns that are unnamed
df = df.drop(columns=['Column1', 'Column2'], errors='ignore')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239 entries, 0 to 238
Data columns (total 9 columns):
 #   Column                                                                        Non-Null Count  Dtype  
---  ------                                                                        --------------  -----  
 0   Month Year                                                                    237 non-null    object 
 1   Dublin - PPI - All Residential (2015 = 100)
Dublin                            237 non-null    float64
 2   National excluding Dublin - all residential properties
National Excl. Dublin  237 non-null    float64
 3   % YoY
Dublin                                                                  225 non-null    object 
 4   % YoY National 
excl. Dublin                                                  225 non-null    object 
 5   MoM 
Dublin                                                                   236 non-null    float64
 6   MoM
National excl. Dublin          

In essence, above column headers provide information about property price trends and changes in Ireland, specifically for Dublin and the rest of the country. They cover both year-over-year and month-over-month comparisons, giving a comprehensive view of the market dynamics.
Here's a breakdown of each column:

1. **Month Year:** This column likely combines the month and year of the data point, representing the time period the data refers to. For example, it could contain values like "January 2023" or "February 2023".

2. **Dublin - PPI - All Residential (2015 = 100) Dublin:** This column seems to represent the Residential Property Price Index (RPPI) for Dublin. The index is based on the year 2015, where the index value is 100. A value of 120 would suggest a 20% increase in residential property prices in Dublin since 2015.

3. **National excluding Dublin - all residential properties National Excl. Dublin:** This column represents the RPPI for the entire country of Ireland, excluding Dublin. Again, it's likely based on the year 2015 with an index value of 100.

4. **% YoY Dublin:** This column likely represents the year-over-year percentage change in the RPPI for Dublin. It shows the percentage change in property prices in Dublin compared to the same period in the previous year.

5. **% YoY National excl. Dublin:** Similar to the previous column, this column likely represents the year-over-year percentage change in the RPPI for the entire country of Ireland, excluding Dublin.

6. **MoM Dublin:** This column likely represents the month-over-month change in the RPPI for Dublin. It shows the absolute difference in property prices in Dublin compared to the previous month.

7. **MoM National excl. Dublin:** Similar to the previous column, this column likely represents the month-over-month change in the RPPI for the entire country of Ireland, excluding Dublin.

8. **%MoM Dublin:** This column likely represents the month-over-month percentage change in the RPPI for Dublin. It shows the percentage change in property prices in Dublin compared to the previous month.

9. **%MoM National excl. Dublin:** Similar to the previous column, this column likely represents the month-over-month percentage change in the RPPI for the entire country of Ireland, excluding Dublin.

### 2. Renaming columns

In this step, we are renaming columns in a way that they are concise and meaningful.

In [None]:
df = df.rename(columns={"Dublin - PPI - All Residential (2015 = 100)\nDublin": "Dublin - PPI (2015 = 100)",
                        "National excluding Dublin - all residential properties\nNational Excl. Dublin": "RPPI National excluding Dublin",
                        "% YoY\nDublin": "% YoY Dublin",
                        "% YoY National \nexcl. Dublin": "% YoY National excl. Dublin",
                        "MoM\nNational excl. Dublin": "MoM National excl. Dublin",
                        "MoM \nDublin": "MoM Dublin",
                        "%MoM\nDublin": "%MoM Dublin",
                        "%MoM\nNational excl. Dublin": "%MoM National excl. Dublin",
                        })

### 3. Removing missing values

In this step, we are removing missing values.

In [None]:
# Check the number of missing values in each column
missing_values = df.isna().sum()

print(missing_values)

Month Year                         2
Dublin - PPI (2015 = 100)          2
RPPI National excluding Dublin     2
% YoY Dublin                      14
% YoY National excl. Dublin       14
MoM Dublin                         3
MoM National excl. Dublin          3
%MoM Dublin                        3
%MoM National excl. Dublin         3
dtype: int64


In [None]:
df = df.dropna().reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 225 entries, 0 to 224
Data columns (total 9 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Month Year                      225 non-null    object 
 1   Dublin - PPI (2015 = 100)       225 non-null    float64
 2   RPPI National excluding Dublin  225 non-null    float64
 3   % YoY Dublin                    225 non-null    object 
 4   % YoY National excl. Dublin     225 non-null    object 
 5   MoM Dublin                      225 non-null    float64
 6   MoM National excl. Dublin       225 non-null    float64
 7   %MoM Dublin                     225 non-null    object 
 8   %MoM National excl. Dublin      225 non-null    object 
dtypes: float64(4), object(5)
memory usage: 15.9+ KB


### 4. Removing '%' from column values

In this step, we are removing % from records since it causes misleading visuals.

In [None]:
def replace_percentages(df, col_name):
  df[col_name] = df[col_name].replace('%', '', regex=True)

replace_percentages(df, '% YoY Dublin')
replace_percentages(df, '% YoY National excl. Dublin')
replace_percentages(df, '%MoM Dublin')
replace_percentages(df, '%MoM National excl. Dublin')

In [None]:
df

Unnamed: 0,Month Year,Dublin - PPI (2015 = 100),RPPI National excluding Dublin,% YoY Dublin,% YoY National excl. Dublin,MoM Dublin,MoM National excl. Dublin,%MoM Dublin,%MoM National excl. Dublin
0,Jan 06,136.8,159.0,15.4,11.9,0.6,0.8,0.4,0.5
1,Feb 06,137.9,158.9,14.0,11.7,1.1,-0.1,0.8,-0.1
2,Mar 06,138.7,159.7,13.8,11.8,0.8,0.8,0.6,0.5
3,Apr 06,141.3,161.8,15.2,12.5,2.6,2.1,1.9,1.3
4,May 06,145.0,164.6,17.2,13.7,3.7,2.8,2.6,1.7
...,...,...,...,...,...,...,...,...,...
220,May 24,157.6,203.7,8.8,8.3,0.7,2.5,0.4,0.4
221,Jun 24,158.9,205.0,9.3,8.2,1.9,2.2,0.8,0.6
222,Jul 24,161.0,208.0,10.3,9.1,2.1,3.0,1.3,1.5
223,Aug 24,162.9,209.4,10.9,9.5,1.9,1.4,1.2,0.7


After performing all the data cleaning steps, our dataset has shrunk to 225 rows and 9 columns.

# Data Visualisation


1.   Property prices in Dublin (Line Chart)
2. Property prices in Dublin (Bar Chart)
3. Property prices inside Dublin vs Nation (Line Chart)
4. Property prices inside Dublin vs Nation (Bar Chart)




### 1. Property prices in Dublin (Line Chart)

In [None]:
# Using plotly.express
import plotly.express as px

fig = px.line(df, x='Month Year', y="Dublin - PPI (2015 = 100)")
fig.show()

In the visualisation, the year 2015 is the reference point and assigned an index value of 100. This means that all other years are compared to the average residential property prices in Dublin during 2015. It is evident that in January 2006, there was 36% increase in housing prices in Dublin since 2015 and in July 2024, there was 64% increase in Dublin since 2015.

### 2. Property prices in Dublin (Bar Chart)

In [None]:
fig = px.bar(df, x='Month Year', y="Dublin - PPI (2015 = 100)")
fig.show()

### 3. Property prices inside Dublin vs Nation (Line Chart)

In [None]:
fig = px.line(df, x='Month Year', y=["Dublin - PPI (2015 = 100)", "RPPI National excluding Dublin"])
fig.show()

### 4. Property prices inside Dublin vs Nation (Bar Chart)

In [None]:
fig = px.bar(df, x='Month Year', y=["Dublin - PPI (2015 = 100)", "RPPI National excluding Dublin"])
fig.show()

My notes:

* Month Year column can separated from each other and develop visualisations accordingly.
* I still don't completely understand the fact that if 2015 is the base year with 100 as reference point, do the columns like %YoY and %MoM really reflect correct percentage?