Skip to content
jianhuashao edited this page Sep 17, 2012 · 5 revisions

Welcome to the AndroidAppsCollector wiki!

Scraping information of app in Android Market. It is part of JianhuaShao's PhD research data collection (Valuation of User Data).

File Structure:

  1. cate_read_**.py: try to get a list of app_id.
  2. app_read_**.py: try to get detail for each app with app_id.
  3. db_**.py: sqlite3 database file for the list of app.

#CMD: 1. python cate_read_**.py 2. python app_read_**.py ## needs to merge the db first 3. python main.py

Source of app_id directory:

  1. https://play.google.com/store/apps: it provides a list of top 504 apps in each category. However, given a app_id, it would tell you the detail of each app.
  2. http://www.androidzoom.com: it provides a nearly complete list of apps according to each category. The app_id needs to find out when turning app page in its domain. Some app_id are out of dates.
  3. http://www.androlib.com: it provides a nearly complete list of apps according to each category. It also specify the app distribution according to language. I needs to turn into app page to find out app_id. Some app_id are out of dates.

Experience on http connection sleep interval.

  1. play.google.com: 1
  2. www.androidzoom.com: 10
  3. www.andrlib.com: 1
  4. www.youtube.com: 1
  5. plusone.google.com: 1
  6. play.google.com/getreview: 10

Experience on time out parameter.

  1. customer review for each app can only get from ajax to google.com.
  2. it is easy to delay the response from google server. So it is experience to set timeout=10 for httplib.