# Google Play Scraping
## Get Information from an Application

### Kadek Dwi Budi Utama
### https://github.com/kadekutama

2016

### URL Initialization and Getting Access to Google Play
In this tutorial, we will scrap a game called Does not Commute.

In [1]:
import urllib3 as u3

url = "https://play.google.com/store/apps/details?id=com.mediocre.commute&hl=en"
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0'}

u3.disable_warnings()
http = u3.PoolManager(10, headers=header)
req = http.urlopen('GET', url)
print(req)
page = req.data
print(page[1:1000])

<urllib3.response.HTTPResponse object at 0x00000197E5D05208>
b'!doctype html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" lang="en"><head><script>(function(){window.latencyTrackerTimes={clientSideStartMs:Date.now()};})();</script><script>(function(){function _DumpException(b){window.console.error(b.stack)};var f=this,m=Date.now||function(){return+new Date};function aa(b,d){var a=["LOWLIFE_wizbind"],c=d||f;a[0]in c||!c.execScript||c.execScript("var "+a[0]);for(var e;a.length&&(e=a.shift());)a.length||void 0===b?c[e]?c=c[e]:c=c[e]={}:c[e]=b};function ba(b,d){if(null===d)return!1;if("contains"in b&&1==d.nodeType)return b.contains(d);if("compareDocumentPosition"in b)return b==d||!!(b.compareDocumentPosition(d)&16);for(;d&&b!=d;)d=d.parentNode;return d==b};var v={};function ca(b,d){return function(a){a||(a=window.event);return d.call(b,a)}}function y(b){b=b.target||b.srcElement;!b.getAttribute&&b.parentNode&&(b=b.parentNode);return b}var C="undefined"!=typeof na

### Getting HTML File from the URL
We used BeautifulSoup library for doing this task.

In [2]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page, "lxml")
print(type(soup))
print(soup.prettify()[1:1000])

<class 'bs4.BeautifulSoup'>
!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#">
 <head>
  <script>
   (function(){window.latencyTrackerTimes={clientSideStartMs:Date.now()};})();
  </script>
  <script>
   (function(){function _DumpException(b){window.console.error(b.stack)};var f=this,m=Date.now||function(){return+new Date};function aa(b,d){var a=["LOWLIFE_wizbind"],c=d||f;a[0]in c||!c.execScript||c.execScript("var "+a[0]);for(var e;a.length&&(e=a.shift());)a.length||void 0===b?c[e]?c=c[e]:c=c[e]={}:c[e]=b};function ba(b,d){if(null===d)return!1;if("contains"in b&&1==d.nodeType)return b.contains(d);if("compareDocumentPosition"in b)return b==d||!!(b.compareDocumentPosition(d)&16);for(;d&&b!=d;)d=d.parentNode;return d==b};var v={};function ca(b,d){return function(a){a||(a=window.event);return d.call(b,a)}}function y(b){b=b.target||b.srcElement;!b.getAttribute&&b.parentNode&&(b=b.parentNode);return b}var C="undefined"!=typeof navigator&&/Macinto

### Getting App Name
According to the element inspector, we find that app name is located at "div" in the class "id-app-title". We also need to encode the string name to "utf-8".

In [5]:
obj = soup.find("div","id-app-title")
name = obj.text
print("%s\n%s" % (obj,name))

<div class="id-app-title" tabindex="0">Does not Commute</div>
Does not Commute


### Getting App Genre
According to the element inspector, we find that app genre is located at "span" with attribute "itemprop="genre"".

In [6]:
obj = soup.find("span",{"itemprop":"genre"})
genre = obj.text
print("%s\n%s" % (obj,genre))

<span itemprop="genre">Racing</span>
Racing


### Getting App Badges
According to the element inspector, we find that app badges is located at "span" in the class "badge-title". Notice that an app may have some badges, so we need an iteration.

In [8]:
obj = soup.find_all("span",{"class":"badge-title"})
print(obj)
badges = []
for row in obj:
    b = row.find(text=True)
    badges.append(b)
print(badges)

[<span class="badge-title">Editors' Choice</span>, <span class="badge-title">Top Developer</span>]
["Editors' Choice", 'Top Developer']


### Getting App Rating
According to the element inspector, we find that app rating is located at "span" in the class "rating-count".

In [9]:
obj = soup.find("span",{"class":"rating-count"})
rating = obj.text
print("%s\n%s" % (obj,rating))

<span aria-label=" 143,071 ratings " class="rating-count">143,071</span>
143,071


### Getting App Score
According to the element inspector, we find that app score is located at "div" in the class "score".

In [10]:
obj = soup.find("div",{"class":"score"})
score = obj.text
print("%s\n%s" % (obj,score))

<div aria-label=" Rated 3.9 stars out of five stars " class="score">3.9</div>
3.9


### Getting App Developer Name
According to the element inspector, we find that app developer name is located at "span" with attribute "itemprop="name"".

In [11]:
obj = soup.find("span",{"itemprop":"name"})
developer = obj.text
print("%s\n%s" % (obj,developer))

<span itemprop="name">Mediocre</span>
Mediocre


### Getting App Version
According to the element inspector, we find that app version is located at "div" with attribute "itemprop="softwareVersion"".

In [12]:
obj = soup.find("div",{"itemprop":"softwareVersion"})
version = obj.text
print("%s\n%s" % (obj,version))

<div class="content" itemprop="softwareVersion"> 1.4.2  </div>
 1.4.2  


### Getting App Content Rating
According to the element inspector, we find that app content rating is located at "div" with attribute "itemprop="contentRating"".

In [13]:
obj = soup.find("div",{"itemprop":"contentRating"})
contentRating = obj.text
print("%s\n%s" % (obj,contentRating))

<div class="content" itemprop="contentRating">Rated for 3+</div>
Rated for 3+


### Getting App Download Number
According to the element inspector, we find that app download number is located at "div" with attribute "itemprop="numDownloads"".

In [14]:
obj = soup.find("div",{"itemprop":"numDownloads"})
downloadNumber = obj.text
print("%s\n%s" % (obj,downloadNumber))

<div class="content" itemprop="numDownloads">  5,000,000 - 10,000,000  </div>
  5,000,000 - 10,000,000  


### Getting App Description
According to the element inspector, we find that app description is located at "div" with attribute "jsname":"C4s9Ed"".

In [15]:
obj = soup.find("div",{"jsname":"C4s9Ed"})
description = obj.text
print(description)

A strategic driving game from the award-winning maker’s of Smash Hit. Does not Commute is a temporal paradox in which you have no one to blame but yourself. What starts out as a relaxing commute in a small town of the 1970's quickly devolves into traffic chaos with hot dog trucks, sports cars, school buses and dozens of other vehicles. You drive them all. Plan ahead. Don't be late.In this small town, discover the characters and their secrets – what world-changing experiment is inventive dentist Dr Charles Schneider hiding? Will Mr Baker quit his job in advertising? What is that strange mask on Mrs Griffin's face? Will Mr Mayfield’s peculiar obsession with Yorkshire Terriers take over his life?Does Not Commute is playable at no cost and free from ads.  An optional premium upgrade is available through a one-time in-app purchase that will enable the ability to continue from checkpoints.


### Getting App Screenshots Link
According to the element inspector, we find that app screenshots link is located at "div" in the class "thumbnails". Notice that an app may have some screenshots, so we need an iteration.
Unfortunately, all links have "//" at the beginning, so we need to replace it to "http://".

In [39]:
obj = soup.find("div",{"class":"thumbnails"})
#print(obj)
screenshotsLink = []

for row in obj.find_all("img"):
    b = row["src"].replace('//','http://')
    screenshotsLink.append(b)
    
[link for link in screenshotsLink]

['http://i.ytimg.com/vi/oXx6khzN6I0/hqdefault.jpg',
 'http://lh3.googleusercontent.com/jAP_Q4Q5Lwf7RqI__8NtxWYq9gfvuW0ge8swDT-5xNtZ8Z1cIHuJqBHza_ZrrNKVSA=h310',
 'http://lh3.googleusercontent.com/pbPDXPfitMX8NtSpXBMs12C-TN0Mqj6rzo6jmkAvhfkK1QhdKBCUbbCK0qKm86iciQ=h310',
 'http://lh3.googleusercontent.com/3CeZjRQN_Yc7anf6jelQeFN7iKTnaO2FXMPNDQDI8bmr6uwa7qYuXzQwfh29VfI_4Q=h310',
 'http://lh3.googleusercontent.com/5S7_Crw0V_i69uVC_C14t_gGeDioiiQAVYIjVp7NkkAjwJgMaw_YWridWYSN7sKbDZbX=h310',
 'http://lh3.googleusercontent.com/y6a96-8Qd9vdJPcbGJUhDYzJxq-TZIlxj-sQl0tuwgXRC_y-gTP-mH0TphttvHCFFw=h310',
 'http://lh3.googleusercontent.com/PuD4dijjDZ0YLxX0zBIGGgSUAkEtSuLRuNQhFK1bZ0b3Qivj1Ra2wnQ5C5BTS1AHC2w=h310',
 'http://lh3.googleusercontent.com/6cm2s-WsyXa6usJpshzZllxT_nu5-phSL4Eiwv58AjNaahVun7_ZpMl6hUOotdDsp3o=h310',
 'http://lh3.googleusercontent.com/g9I7hwhvZXA5X5NOdvPyf4NcXAIUp2VzksBg7zCNCSSnVdgeHn5iAr_hGNFixxeAqj-q=h310',
 'http://lh3.googleusercontent.com/3gokSAjUl2if8Q_0YmW-bxEflPmdhSi7XRl

### Saving App Screenshots
There are two main points when we want to save those screenshots: download folder and file name. For example, i'll set the folder to "DoesNotCommute" and the file name to "DoesNotCommuteX.png" (X is an index number).

First, we need to check whether folder "DoesNotCommute" is available or not in current folder (using os.path.exists(folder)). If there isn't one, we create that folder.

Second, we iterate all the screenshots link to make a request to its server and open its URL.

At last, we save the opened URL as image in folder "DoesNotCommute".

In [43]:
import os

folder = "DoesNotCommute"
name = "DoesNotCommute"
index = 1
imgFormat = ".png"

if not os.path.exists(folder):
    os.makedirs(folder)
    
for link in screenshotsLink:
    req = http.urlopen('GET',link)
    with open(os.path.join(folder, name+str(index)+imgFormat),'wb') as img:
        img.write(req.data)
    index += 1

### Summary:

In [41]:
print("Name\t\t: %s" % name)
print("Genre\t\t: %s" % genre)
print("Badges\t\t:")
for b in badges:
    print("\t\t  %s" % b)
print("Rating\t\t: %s" % rating)
print("Score\t\t: %s" % score)
print("Developer name\t: %s" % developer)
print("Version\t\t: %s" % version)
print("Content rating\t: %s" % contentRating)
print("Download number\t: %s" % downloadNumber)
print("Description\t: \n%s" % description)
print("Screenshots Link:")
[s for s in screenshotsLink]

Name		: DoesNotCommute
Genre		: Racing
Badges		:
		  Editors' Choice
		  Top Developer
Rating		: 143,071
Score		: 3.9
Developer name	: Mediocre
Version		:  1.4.2  
Content rating	: Rated for 3+
Download number	:   5,000,000 - 10,000,000  
Description	: 
A strategic driving game from the award-winning maker’s of Smash Hit. Does not Commute is a temporal paradox in which you have no one to blame but yourself. What starts out as a relaxing commute in a small town of the 1970's quickly devolves into traffic chaos with hot dog trucks, sports cars, school buses and dozens of other vehicles. You drive them all. Plan ahead. Don't be late.In this small town, discover the characters and their secrets – what world-changing experiment is inventive dentist Dr Charles Schneider hiding? Will Mr Baker quit his job in advertising? What is that strange mask on Mrs Griffin's face? Will Mr Mayfield’s peculiar obsession with Yorkshire Terriers take over his life?Does Not Commute is playable at no cost and 

['http://i.ytimg.com/vi/oXx6khzN6I0/hqdefault.jpg',
 'http://lh3.googleusercontent.com/jAP_Q4Q5Lwf7RqI__8NtxWYq9gfvuW0ge8swDT-5xNtZ8Z1cIHuJqBHza_ZrrNKVSA=h310',
 'http://lh3.googleusercontent.com/pbPDXPfitMX8NtSpXBMs12C-TN0Mqj6rzo6jmkAvhfkK1QhdKBCUbbCK0qKm86iciQ=h310',
 'http://lh3.googleusercontent.com/3CeZjRQN_Yc7anf6jelQeFN7iKTnaO2FXMPNDQDI8bmr6uwa7qYuXzQwfh29VfI_4Q=h310',
 'http://lh3.googleusercontent.com/5S7_Crw0V_i69uVC_C14t_gGeDioiiQAVYIjVp7NkkAjwJgMaw_YWridWYSN7sKbDZbX=h310',
 'http://lh3.googleusercontent.com/y6a96-8Qd9vdJPcbGJUhDYzJxq-TZIlxj-sQl0tuwgXRC_y-gTP-mH0TphttvHCFFw=h310',
 'http://lh3.googleusercontent.com/PuD4dijjDZ0YLxX0zBIGGgSUAkEtSuLRuNQhFK1bZ0b3Qivj1Ra2wnQ5C5BTS1AHC2w=h310',
 'http://lh3.googleusercontent.com/6cm2s-WsyXa6usJpshzZllxT_nu5-phSL4Eiwv58AjNaahVun7_ZpMl6hUOotdDsp3o=h310',
 'http://lh3.googleusercontent.com/g9I7hwhvZXA5X5NOdvPyf4NcXAIUp2VzksBg7zCNCSSnVdgeHn5iAr_hGNFixxeAqj-q=h310',
 'http://lh3.googleusercontent.com/3gokSAjUl2if8Q_0YmW-bxEflPmdhSi7XRl