<a href="https://colab.research.google.com/github/nairababayan/linear_regression/blob/master/apartment_prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Yerevan apartment prices prediction using Linear Regression**

In [0]:
import pandas as pd
import numpy as np
from geopy.distance import geodesic
from sklearn.preprocessing import LabelEncoder

In [4]:
# Using pandas library we read the csv file into a dataframe
file_path = 'https://raw.githubusercontent.com/nairababayan/linear_regression/master/yerevan_apartments.csv'
df = pd.read_csv(file_path)

# Now let's print the first 5 lines of the file to see the data we are dealing with.
# I have added latitude and longtitude columns to the original data, based on the street address.
df.head(5)

Unnamed: 0.1,Unnamed: 0,price,condition,district,max_floor,street,latitude,longtitude,num_rooms,area,num_bathrooms,building_type,floor,ceiling_height
0,0,65000,good,Center,9,Vardanants St,40.174314,44.521089,3,80,1,panel,4,2.8
1,1,140000,newly repaired,Arabkir,10,Hr.Kochar St,40.201211,44.501362,4,115,1,monolit,2,3.0
2,2,97000,newly repaired,Center,10,Teryan St,40.186158,44.518502,2,72,1,panel,3,2.8
3,3,47000,good,Center,9,D. Demirchyan St,40.189362,44.507579,1,43,1,panel,9,2.8
4,4,51000,newly repaired,Center,14,Sayat Nova Ave,40.182024,44.521,1,33,1,other,4,2.8


As apartment prices depend on the distance from city center, we are adding another column named 'distance' to our dataframe, which will be the main feature of pricing. As the city center I took the Northern Avenue, which has the highest prices.

![alt text](https://raw.githubusercontent.com/nairababayan/linear_regression/master/yerevan_city_center.png)

In [15]:
# The value of distance will be calculated using GeoPy library (https://geopy.readthedocs.io/en/stable/).
city_center = [40.184148, 44.514906]
df['distance'] = list(map(lambda x, y: geodesic(city_center, [x, y]).meters, df['latitude'], df['longtitude']))
df['distance'] = df['distance'].round(0)

df.head(5)

Unnamed: 0.1,Unnamed: 0,price,condition,district,max_floor,street,latitude,longtitude,num_rooms,area,num_bathrooms,building_type,floor,ceiling_height,distance
0,0,65000,0,Center,9,Vardanants St,40.174314,44.521089,3,80,1,2,4,2.8,1212.0
1,1,140000,1,Arabkir,10,Hr.Kochar St,40.201211,44.501362,4,115,1,0,2,3.0,2218.0
2,2,97000,1,Center,10,Teryan St,40.186158,44.518502,2,72,1,2,3,2.8,379.0
3,3,47000,0,Center,9,D. Demirchyan St,40.189362,44.507579,1,43,1,2,9,2.8,851.0
4,4,51000,1,Center,14,Sayat Nova Ave,40.182024,44.521,1,33,1,1,4,2.8,570.0


In [9]:
# If we print the full summary of our dataframe, we can see that we have two object columns 'building_type' and 'condition' that are an important part of pricing. 
# Our ML algorithm can only read numerical values, so we should encode these categorical features into numerical values.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      6234 non-null   int64  
 1   price           6234 non-null   int64  
 2   condition       6234 non-null   object 
 3   district        6234 non-null   object 
 4   max_floor       6234 non-null   int64  
 5   street          6234 non-null   object 
 6   latitude        6234 non-null   float64
 7   longtitude      6234 non-null   float64
 8   num_rooms       6234 non-null   int64  
 9   area            6234 non-null   int64  
 10  num_bathrooms   6234 non-null   int64  
 11  building_type   6234 non-null   object 
 12  floor           6234 non-null   int64  
 13  ceiling_height  6234 non-null   float64
 14  distance        6234 non-null   float64
dtypes: float64(4), int64(7), object(4)
memory usage: 730.7+ KB


In [14]:
# We can use LabelEncoder, which converts each class under specified feature to a numerical value.
# cat_columns = df.select_dtypes(['object']).columns
cat_columns = ['condition','building_type']   # we do it only for this 2 columns, not all object types

le = LabelEncoder()
df[cat_columns] = df[cat_columns].apply(lambda col: le.fit_transform(col))

print(df.condition.unique())
print(df.building_type.unique())

[0 1 2]
[2 0 1 3]
