# Web Scraping Project

### Web Scraping ?
* Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

* Web scraping or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.
* The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else.

![](https://uploads-ssl.webflow.com/5fd55aec9b6ceba1eec9f9fd/61932c032e0a0173df6d2377_What%20is%20Web%20Scraping.jpg)

## Objective:

* All Important features related to housing such as **house price, total sqft area, project, location, BHK and so on** are collected by web scraping
* To know Price of house for differents location in mumbai city 
* All data to be scraped from website ***https://www.makaan.com*** by various techniques
* To Perform **EDA, Data cleaning and Data Visualization on Raw data** to understand important features and their correlation to with house price 

## Import Required Libraries

In [1]:
import requests
from bs4 import BeautifulSoup

## Define URL

In [2]:
url='https://www.makaan.com/mumbai-residential-property/buy-property-in-mumbai-city?page=2'

In [3]:
#Send http request to server
response=requests.get(url)

In [4]:
#check the status of the request 
response.status_code

200

* **Conclusion:**
* since Status Code is **200** means request was successfully accepted

## Parse the downloaded data using BeautifulSoup

In [5]:
#create variable soup to store html files
soup = BeautifulSoup(response.text,'html.parser')
#print(soup.prettify())

In [6]:
## check length of parse Data
print(len(soup))

3


## Extracting the necessary data from the parsed data
* No of BHK
* Project By
* Price of House
* Total area OF House
* Location
* City
* House img

### A. Extract only data of 1st block or index of 1st page 

In [7]:
#To get all data of 1st block or index of 1st page OR 1ST RECORD 
main_class='infoWrap'
data=soup.find('div',{'class':main_class})

In [8]:
#print html data 
print(data)
#print(data.prettify())

<div class="infoWrap" itemprop="event" itemref="itemImageFor-19237160" itemscope="" itemtype="http://schema.org/Event"><div class="title-line-wrap"><div class="title-line"><a class="typelink" data-type="listing-link" href="https://www.makaan.com/mumbai/shree-krishna-groups-sangam-in-chembur-19237160/3bhk-984-sqft-apartment" itemprop="url" target="_blank"><meta content="3 BHK Apartment for sale" id="itemNameFor-19237160" itemprop="name"/><strong><span class="val">3 </span><span>BHK </span><span>Apartment</span></strong></a><span class="project-wrap"> in <strong><a class="projName" data-link-name="SHREE KRISHNA Sangam" data-link-type="project overview" data-track-label="19237160_1_3146965_select" data-type="projName" href="https://www.makaan.com/mumbai/shree-krishna-groups-sangam-in-chembur-3146965" target="_blank" title="Go to SHREE KRISHNA Sangam"><span>SHREE KRISHNA Sangam</span></a></strong></span><div class="rera-tag-new" title="Rera Approved Project"><img alt="Rera Approved Project

In [9]:
#length of data
print(len(data))

6


In [10]:
#Area of house(BHK) and project_BY 
class_='title-line-wrap'
print(data.find('div',{'class':class_}).text)

3 BHK Apartment in SHREE KRISHNA SangamChembur, Mumbai


In [11]:
# Exract info of BHK and Project_by
class_1='title-line-wrap'
BHK=data.find('div',{'class':class_1}).text.split(',')[0].split(' in ')[0]
Project=data.find('div',{'class':class_1}).text.split(',')[0].split(' in ')[1]

print(BHK)
print(Project)

3 BHK Apartment
SHREE KRISHNA SangamChembur


In [12]:
#Exract Price of house from price class
price_class='price'
price=data.find('td',{'class':price_class}).text
print(price)

 3.05 Cr


In [13]:
#Price of house per sq feet area  (price/sq feet)
price_sqft=data.find('td',{'class':'lbl rate'}).text
print(price_sqft)

31,000 / sq ft


In [14]:
#Total sq feet area of house OR CARPET AREA 
class_total_area='size'
total_area_house=data.find('td',{'class':class_total_area}).text
print(total_area_house)

984 


In [15]:
#Location and city
print(data.find('a',{'class':'loclink'}).text)

#location
location=data.find('a',{'class':'loclink'}).text.split(',')[0].strip()
city=data.find('a',{'class':'loclink'}).text.split(',')[1].strip()

print(location)
print(city)

Chembur, Mumbai
Chembur
Mumbai


In [16]:
#get images of house
img_class='imgWrap'
img_house=soup.find('div',{'class':img_class})
img_link=img_house.find('img').get('data-src')
print(img_link)

https://static.makaan.com/1/3146965/291/sangam-landscape-garden-and-tree-planting-128722923.jpeg?width=460&height=260


### B. To Extract Data of All Blocks from 1st Page 

In [17]:
#get all data of 1st page
main_class='infoWrap'
data=soup.find_all('div',{'class':main_class})

In [18]:
# Total No of blocks in single page
print(len(data))

20


* **Conclusion:**
* There are Total **20 records or blocks** in single page

In [19]:
# Get all information of 12th block of 1st page
BHK=data[11].find('div',{'class':class_1}).text.split(',')[0].split(' in ')[0]
Project=data[11].find('div',{'class':class_1}).text.split(',')[0].split(' in ')[1]
price=data[11].find('td',{'class':price_class}).text.strip()
total_area_house=data[11].find('td',{'class':class_total_area}).text
location=data[11].find('a',{'class':'loclink'}).text.split(',')[0].strip()
city=data[11].find('a',{'class':'loclink'}).text.split(',')[1].strip()
img_house=soup.find_all('div',{'class':img_class})
img_link=img_house[11].find('img').get('data-src')


print("BHK -",BHK)
print("Project_By -",Project)
print("Price of house -",price)
print("Total area -",total_area_house)
print("Location -",location)
print("city -",city)
print("img link of house -",img_link)

BHK - 1 BHK Apartment
Project_By - A Plus Golden VanVangani
Price of house - 10.75 L
Total area - 417 
Location - Vangani
city - Mumbai
img link of house - https://static.makaan.com/1/1936709/291/golden-van-swimming-pool-131986548.jpeg?width=460&height=260


### C. To Extract All data from Multiple Pages  

In [20]:
# base url for all pages
base='https://www.makaan.com/mumbai-residential-property/buy-property-in-mumbai-city?page='

* **Conclusion:**
* Common part URL for all page 

In [21]:
#list to store all Page No
list_page_no=list(range(2,201))

In [22]:
#Total no pages
print(list_page_no)

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200]


In [23]:
#create list to store all URL of multiple pages
url_list=[]

#concate page no to base url
for i in list_page_no:
    url=str(base+str(i))
    url_list.append(url)

In [24]:
#URL list
print(url_list[:5])

['https://www.makaan.com/mumbai-residential-property/buy-property-in-mumbai-city?page=2', 'https://www.makaan.com/mumbai-residential-property/buy-property-in-mumbai-city?page=3', 'https://www.makaan.com/mumbai-residential-property/buy-property-in-mumbai-city?page=4', 'https://www.makaan.com/mumbai-residential-property/buy-property-in-mumbai-city?page=5', 'https://www.makaan.com/mumbai-residential-property/buy-property-in-mumbai-city?page=6']


In [25]:
#create list for each feature
bhk_list=[]
project_list=[]
price_list=[]
price_sqf_list=[]
total_sqf_list=[]
city_list=[]
location_list=[]

l=url_list[:201]

for indx,url in enumerate(l):
    response=requests.get(url)
    res=response.status_code
    print("page scraped",indx, end="  ")
    soup = BeautifulSoup(response.text,'html.parser')
    page_data = soup.find_all('div',{'class':'infoWrap'})
    
    for i in page_data:
        #BHK and project
        lst=i.find('div',{'class':'title-line-wrap'}).text.split(',')[0]
        if ' in ' in lst:
            lst1=lst.split(' in ')
            bhk=lst1[0]
            project=lst1[1]       
            bhk_list.append(bhk)
            project_list.append(project)
        else:
            bhk=lst[0]
            project="NA"
            bhk_list.append(bhk)
            project_list.append(project)
        
        #Price
        price=i.find('td',{'class':'price'}).text.strip()
        price_list.append(price)
        
        #Price / sq feet area for house
        psqf=i.find('td',{'class':'lbl rate'}).text.strip().split('/')
        psqf1=list(psqf)[0].strip()
        price_sqf_list.append(psqf1)
        
        #Total sft area 
        total_sft_area=i.find('td',{'class':'size'}).text.strip()
        total_sqf_list.append(total_sft_area)
        
        #city and location
        area=i.find('a',{'class':'loclink'}).text.split(',')
        l1=list(area)
        #Location
        location=l1[0].strip()
        #city
        city=l1[1].strip()
        city_list.append(city)
        location_list.append(location)

page scraped 0  page scraped 1  page scraped 2  page scraped 3  page scraped 4  page scraped 5  page scraped 6  page scraped 7  page scraped 8  page scraped 9  page scraped 10  page scraped 11  page scraped 12  page scraped 13  page scraped 14  page scraped 15  page scraped 16  page scraped 17  page scraped 18  page scraped 19  page scraped 20  page scraped 21  page scraped 22  page scraped 23  page scraped 24  page scraped 25  page scraped 26  page scraped 27  page scraped 28  page scraped 29  page scraped 30  page scraped 31  page scraped 32  page scraped 33  page scraped 34  page scraped 35  page scraped 36  page scraped 37  page scraped 38  page scraped 39  page scraped 40  page scraped 41  page scraped 42  page scraped 43  page scraped 44  page scraped 45  page scraped 46  page scraped 47  page scraped 48  page scraped 49  page scraped 50  page scraped 51  page scraped 52  page scraped 53  page scraped 54  page scraped 55  page scraped 56  page scraped 57  page scraped 58  page sc

### conclusion
* Scraped **250 pages** scuccessfully

In [26]:
# To know total No of records
print(len(bhk_list))
print(len(price_list))
print(len(project_list))
print(len(price_sqf_list))
print(len(total_sqf_list))
print(len(location_list))
print(len(city_list))

3980
3980
3980
3980
3980
3980
3980


In [27]:
#import library
import pandas as pd
import numpy as np

In [28]:
#create dict
dict = { 'BHK':bhk_list,'project':project_list,'Location':location_list,'City':city_list,
        'Total sqft':total_sqf_list,'price_sqft':price_sqf_list,'price':price_list}

* Create Dataframe to store All Information 
* Here not storing images link to dataframe

In [29]:
#create dataframe which holds all information
df=pd.DataFrame(dict)

In [30]:
#shape of dataframe
df.shape

(3980, 7)

In [31]:
#Top 10 records
df.head(10)

Unnamed: 0,BHK,project,Location,City,Total sqft,price_sqft,price
0,3 BHK Apartment,SHREE KRISHNA SangamChembur,Chembur,Mumbai,984,31000,3.05 Cr
1,2 BHK Apartment,Ekdanta 24 KaratKurla,Kurla,Mumbai,598,23913,1.42 Cr
2,2 BHK Apartment,Liberty Bay VueMalad West,Malad West,Mumbai,738,21000,1.54 Cr
3,3 BHK Apartment,Thalia Vrindavan FloraRasayani,Rasayani,Mumbai,644,10676,68.75 L
4,2 BHK Apartment,Mayfair The ViewVikhroli,Vikhroli,Mumbai,582,24914,1.45 Cr
5,2 BHK Apartment,Puraniks City Sector 1Neral,Neral,Mumbai,427,5756,24.58 L
6,3 BHK Apartment,Jewel CrestMahim,Mahim,Mumbai,1130,42477,4.8 Cr
7,2 BHK Apartment,Aplite Greenstone HeritageFort,Fort,Mumbai,671,40536,2.72 Cr
8,3 BHK Apartment,Mahaavir PrideDombivali,Dombivali,Mumbai,917,10359,95 L
9,1 BHK Independent House,VBHC 47 Rowland ParkPalghar,Palghar,Mumbai,701,3894,27.3 L


In [32]:
#Shape of Dataframe
df.shape

(3980, 7)

In [33]:
#total unique location
len(df['project'].unique())

968

In [34]:
#save csv file
df.to_csv('house_price_mumbai.csv',index=False)

In [35]:
df1=df.drop_duplicates()
df1.shape

(2875, 7)

In [36]:
len(df['Location'].unique())

174

In [37]:
df['Location'].unique()

array(['Chembur', 'Kurla', 'Malad West', 'Rasayani', 'Vikhroli', 'Neral',
       'Mahim', 'Fort', 'Dombivali', 'Palghar', 'Malad East', 'Vangani',
       'Jogeshwari West', 'Dahisar', 'Borivali East', 'Mulund West',
       'Kharghar', 'Vasai', 'Santacruz East', 'Ambernath East',
       'Andheri West', 'Mazagaon', 'Karjat', 'Khopoli', 'Goregaon East',
       'Panvel', 'Kalyan West', 'Andheri East', 'Dronagiri', 'Umroli',
       'Virar', 'Kalyan East', 'Thane West', 'Shahapur', 'Taloja', 'Ulwe',
       'Badlapur East', 'Titwala', 'Diva', 'Kamothe', 'Bhiwandi',
       'Kandivali West', 'Powai', 'Badlapur West',
       'kasaradavali thane west', 'Dombivali East', 'Mira Road East',
       'Vasai east', 'Wada', 'Dadar East', 'Colaba', 'Sion',
       'Santosh Nagar', 'Virar East', 'Agripada', 'Gorai', 'Virar West',
       'Santacruz West', 'Kalwa', 'Nala Sopara', 'Hendre Pada',
       'Vasai West', 'Mazgaon', 'Belapur', 'Borivali West', 'Nerul',
       'Kumbharkhan Pada', 'Ghatkopar West', 'K

* Note-- 1st objective of my project i.e web scraping get completed now to move towards data cleaning and visualization