### House Price

In this project, I analyze a Telegram group's announcements which is called "Zamin", it means Land. Here the selling lands, shops, houses and apartments from two cities of Afghanistan (Kabul and Mazar e sharif) are being announced. The announcement contains almost all relevant information needed for selling these items, but is not a structured approach. Some announcement also has the items' picture, another has enough information, while some of them are missing it.  

I have scraped the data from the Telegram group through the python scripts which are in my Github account ([Github Repository](https://github.com/grzaini/zamin)) and I have analyzed the obtained json file and at the end I have done an exploration of the data.

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import requests
import json

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
loaded_data = json.load(open('/kaggle/input/telegram-zamin-data/channel_messages.json'))
df = pd.json_normalize(data=loaded_data)
df.head(2)

Exploring data and extracting usable partion by the next command.

In [None]:
df.columns.values

In [None]:
projected_df = df[['id', 'date', 'message','views',
       'forwards', 'edit_date','reactions.results','media.photo.date','media.photo.sizes']]

As shown in Data that more usable data is in "message" column, so should be focused on it more and it should be more analyzed in the comming steps. 

- first the empty message column become null and then deleted.
- and some columns which are containing "Date and Time" with a readable manner formatted.

In [None]:
pd.set_option('mode.chained_assignment', None)
projected_df.loc[:,'message'].replace('', np.nan, inplace=True)
projected_df.dropna(subset=['message'], inplace=True)

#format the date and time data in the fd.
projected_df.loc[:,"date"] = pd.to_datetime(projected_df.loc[:,'date']).dt.strftime('%Y-%m-%d')
projected_df.loc[:,'edit_date'] = pd.to_datetime(projected_df['edit_date']).dt.strftime('%Y-%m-%d')
projected_df.loc[:,'media.photo.date'] = pd.to_datetime(projected_df['media.photo.date']).dt.strftime('%Y-%m-%d')

projected_df.info()

In [None]:
projected_df.head()

As shown that the "message" column of the data contains all information which is needed to this porpuse, and is a dirty data containing even the emotions symbols and other information in several lines. the next step removes all emotions symbols and merge all lines by one line.

In [None]:
import re

# Function to merge lines and remove emoji signs
def merge_lines_and_remove_emojis(text):
    # Merge lines into a single line
    merged_line = ' '.join(text.split('\n'))
    
    # Remove emoji signs using a regular expression
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # Emoticons
                               u"\U0001F300-\U0001F5FF"  # Symbols & Pictographs
                               u"\U0001F680-\U0001F6FF"  # Transport & Map Symbols
                               u"\U0001F700-\U0001F77F"  # Alchemical Symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               u"\U00002702-\U000027B0"  # Dingbats
                               "]+", flags=re.UNICODE)
    
    cleaned_text = emoji_pattern.sub('', merged_line)
    return cleaned_text

# Apply the merge and remove function to the DataFrame
projected_df['cleaned_message'] = projected_df['message'].apply(merge_lines_and_remove_emojis)


Now, all informations about the sold items are in the "message" column. For example, house, yard, shop, apartment and etc is sold or rented in which area of the cities by which mesaurement. the next step finds all these keywords in the column and creates appropiate columns for each of them and adds these columns in the Dataframe.

In [None]:
# Function to extract the next two words after 'فروش'
words = ['حویلی','زمین','اپارتمان','بلاک','زمين','حولي','حويلي','بلاك','دکان']
status = ['فروش','کرایه','گروی','گرو','']
measurement = ['متر','بیسوه','جریب']
area = ['مزارشریف:','مزارشريف:','کابل','مزارشريف']

def extract_substrings(text, substrings):
    found_substrings = [substring for substring in substrings if substring in text]
    return ', '.join(found_substrings)

# Apply the extraction function to the DataFrame
projected_df['category'] = projected_df['cleaned_message'].apply(extract_substrings, args=(words,))
projected_df['status'] = projected_df['cleaned_message'].apply(extract_substrings, args=(status,))
projected_df['measurement'] = projected_df['cleaned_message'].apply(extract_substrings, args=(measurement,))
projected_df['city'] = projected_df['cleaned_message'].apply(extract_substrings, args=(area,))
projected_df.head(2)

This step indicates that the sold or rented items' actuall location. In which area of the cities Kabul and mazar e sharif, the item is located. And stores this location information in another column named "area" and adds it on the dataframe.

It uses a function which extracts 3 next words of the searched keywords, while the location is almost always noted after the cities' name or a significant area's name which are listed in an array (area).

In [None]:
area = ['بلخ :','شهرک خالد:','شهرک خالد','کابل:','مزارشريف:','کابل','مزارشريف']

# Function to extract the next three words
def extract_next_three_words(text, persian_word):
    words = text.split()
    for pw in persian_word:
        if pw in words:
            index = words.index(pw)
            next_words = " ".join(words[index + 1:index + 4])
            return next_words
        else:
            continue

projected_df['area'] = projected_df['cleaned_message'].apply(extract_next_three_words, args=(area,))
projected_df.head()

Now, we extract the cost of the sold and rented item from the "message" column.

In [None]:
#text = " این حویلی به قیمت 400000 فروخته شد. و پارسال به قیمت 300000افغانی به کرایه گذاشته شده بود"
number_pattern = r'\d*'
def extract_cost(t):
    amount = 0
    if(t.find('$') != -1):
        end = t.find('$')
        start = t.find(' ', end-10)
        amount = ''.join(re.findall(number_pattern, t[start:end])).strip()
    elif(t.find('لک') != -1):
        end = t.find('لک')
        start = t.find(' ', end-7)
        amount = ''.join(re.findall(number_pattern, t[start:end])).strip()
    elif(t.find('افغانی') != -1):
        end = t.find('افغانی')
        start = t.find(' ', end-10)
        amount = ''.join(re.findall(number_pattern, t[start:end])).strip()
    else:
        amount = 0
    return amount

def extract_currency(t):
    currency = 0
    if(t.find('$') != -1):
        currency = '$'
    elif(t.find('لک') != -1):
        currency = 'لک'
    elif(t.find('افغانی') != -1):
        currency = 'افغانی'
    else:
        currency = ""
    return currency
    
#print(extract_cost(text))

projected_df['amount'] = projected_df['cleaned_message'].astype(str).str[20:].apply(extract_cost)
projected_df['currency'] = projected_df['cleaned_message'].astype(str).str[20:].apply(extract_currency)

In [None]:
projected_df.head()

In [None]:
projected_df.to_csv(r'projected_df.csv', index=False)

In [None]:
cleaned_df = pd.read_csv('/kaggle/input/cleaned-df/cleaned_df.csv')
cleaned_df.head(2)

In [None]:
houses = cleaned_df[(cleaned_df['category'].astype(str).str.contains('حو')) 
           & ~((cleaned_df['status'].astype(str).str.contains('گر')) 
               | (cleaned_df['status'].isnull()) 
               | (cleaned_df['measurement'].astype(str).str.contains('متر')))]

houses.head()

In [None]:
mazar_houses = houses[houses['city'].astype(str).str.contains('مز')].sort_values(by='amount')
mazar_houses.head()

In [None]:
kabul_houses = houses[houses['city'].astype(str).str.contains('کا')].sort_values(by='amount')
kabul_houses.head()

In [None]:
!pip install arabic-reshaper
!pip install python-bidi

In [None]:
from matplotlib.ticker import AutoMinorLocator, MultipleLocator
import arabic_reshaper
import matplotlib.pyplot as plt
from bidi.algorithm import get_display

x = [ ]

for item in mazar_houses['area'].astype(str).values:
    x.append(get_display(arabic_reshaper.reshape(item)))

fig, ax = plt.subplots(figsize=(14, 5.7), layout='constrained')
ax.plot(x, mazar_houses['amount'].astype(str), 'bo')
ax.yaxis.set_major_locator(MultipleLocator(10))
ax.tick_params(axis='x', rotation=90, labelsize=6)
ax.set_xlabel('Areas in Mazar City')
ax.set_ylabel('House Price in (USD)')

In [None]:
mazar_houses = houses[houses['city'].astype(str).str.contains('مز')].sort_values(by=['date', 'amount'])

fig, ax = plt.subplots(figsize=(14, 5.7), layout='constrained')
ax.plot(mazar_houses['date'], , mazar_houses['amount'].astype(str), 'bo')
ax.grid(True)
ax.yaxis.set_major_locator(MultipleLocator(10))
ax.xaxis.set_major_locator(MultipleLocator(2))
ax.tick_params(axis='x', rotation=90, labelsize=6)
ax.set_xlabel('Date of Announcement (sorted)')
ax.set_ylabel('House Price in (USD)')
ax.set_title('Announcement of House Price of MAZAR City')
fig.savefig('house_announce.png')

In [None]:
mazar_houses = mazar_houses.sort_values(by='date')
kabul_houses = kabul_houses.sort_values(by='date')
fig, ax = plt.subplots(figsize=(14, 5.7), layout='constrained')
ax.plot(kabul_houses['date'], kabul_houses['amount'].astype(str), 'bo')
ax.plot(mazar_houses['date'], mazar_houses['amount'].astype(str), 'ro')
#ax.grid(True)
ax.yaxis.set_major_locator(MultipleLocator(10))
ax.xaxis.set_major_locator(MultipleLocator(2))
ax.tick_params(axis='x', rotation=90, labelsize=6)
ax.legend(['Kabul', 'Mazar'])
ax.set_xlabel('Date of Announcement')
ax.set_ylabel('House Price in (USD)')
ax.set_title('Announcement of House Price of MAZAR & Kabul City')
fig.savefig('house_announce.png')

![](http://)