# Lab 1: Information Retrieval

__Students:__ Sebastian Callh sebca553, Jacob Lundberg jaclu010

### Crawling
The corpus for this assignment will be at least 1000 Google Play app descriptions. To acquire those we crawl all the categories for their presented app urls, and then those app urls for their description. First off let's import used packages.

In [21]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import urllib.request
import nltk
import pickle
from functools import reduce
from itertools import chain
from nltk.stem.snowball import SnowballStemmer
import re
import os

To scrape the urls we use `urllib` and `re` with the helper functions defined below.

In [22]:
catreg = r'<a class=\"child-submenu-link\" href=\"(/store/apps/category/.*?)\" title=\".*?\" jsl=\"\$x 5;\" jsan=\"7.child-submenu-link,8.href,0.title\">.*?<\/a>'
catre = re.compile(catreg)
appreg = r'href=\"(/store/apps/details.*?)\"'
appre = re.compile(appreg)

play_url = 'https://play.google.com'
def scrape_cat_urls(url):
    mkdwn = urllib.request.urlopen(url).read().decode('utf-8')
    return re.findall(catre, mkdwn)

def scrape_app_urls(cat_url):
    mkdwn = urllib.request.urlopen(play_url + cat_url).read().decode('utf-8')
    return re.findall(appre, mkdwn)


To fill the quota of 1000 descriptions, some additional links are added.

In [23]:
cat_urls = scrape_cat_urls(play_url + '/store/apps')
cat_urls.append('/store/search?q=poop&c=apps')
cat_urls.append('/store/apps/new')
cat_urls.append('/store/apps/top')
app_urls = set(reduce(lambda lst, url: lst + scrape_app_urls(url), cat_urls, []))
len(app_urls)

1177

When scraping all the chosen pages we are well above the requirement, so we can now scrape the actual app descriptions (together with the app names). The descriptions are stored in a pickle file to avoid re-scraping the web pages.

In [6]:
desc = r'itemprop=\"description.*?\">.*?<div jsname=\".*?\">(.*?)</div>'
name = r'itemprop=\"name\" content=\"(.*?)\"\/>'
app_desc_re = re.compile(desc)
app_name_re = re.compile(name)

def scrape_app(app_url):
    mkdwn = urllib.request.urlopen(play_url + app_url + '&hl=en').read().decode('utf-8')
    desc = re.findall(app_desc_re, mkdwn)[0]
    name = re.findall(app_name_re, mkdwn)[0]
    return name, desc

apps = [scrape_app(url) for url in app_urls]
with open('app-descriptions.pkl', 'wb') as f:
    pickle.dump(apps, f)

### Tokenizing and tfidf representation
With the corpus acquired, we can now creat ifidf representations for the descriptions. To do this we use the `TfidfVectorizer` from `sklearn` which constructs the vectors as needed, and also allows us to specify our own tokenizer function and stopword list.

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

def load_stopwords():
    nltk_path = './nltk'
    if not os.path.exists(nltk_path):
        os.makedirs(nltk_path)
     
    nltk.download('stopwords', download_dir=nltk_path)

    custom_words =  ['br']
    with open(nltk_path + '/corpora/stopwords/english') as f:
        nltk_words = [x.strip() for x in f.readlines()]
        
    return custom_words + nltk_words
    
def tokenize(d): 
    stemmer = SnowballStemmer("english", ignore_stopwords=True)
    tokens = [s.lower() for s in nltk.word_tokenize(d) if s.isalpha()]
    return [stemmer.stem(w) for w in tokens]


with open('app-descriptions.pkl', 'rb') as f:
    apps = pickle.load(f)
    
stopwords = load_stopwords()
transvector = TfidfVectorizer(tokenizer=tokenize, stop_words=stopwords, analyzer = "word") # we need to think about where stopwords need to show up
app_descrs = [app[1] for app in apps]
vocab_matrix = transvector.fit_transform(app_descrs)


[nltk_data] Downloading package stopwords to ./nltk...
[nltk_data]   Package stopwords is already up-to-date!


### Query Process
We can finally make queries against our corpus by using the transvector we fitted in the previous step to map the query into our vector space, and the cosine similarity.

In [58]:
def n_max(arr, n):
    indices = arr.ravel().argsort()[-n:]
    indices = (np.unravel_index(i, arr.shape) for i in indices)
    return reversed([(arr[i], i[1]) for i in indices])

def query(qstring, k):
    q = transvector.transform([qstring])
    res = cosine_similarity(q, vocab_matrix)
    return n_max(res, k)

def print_query_res(apps, res):
    app_index = res[1]
    app_sim = res[0]
    print("ID:", app_index, " - ", apps[app_index][0], " - Sim:", app_sim)
    print(apps[app_index][1][0:50], "...")
    

Querying "Dragon, Control, hero, running" and "The hero controls the dragon to run." gives the exact same results, which is expected given that the tokenization process works as expected.

In [61]:
for qres in query("Dragon, Control, hero, running", 5):
    print_query_res(apps, qres)

print("---------------------------------------")

for qres in query("The hero controls the dragon to run.", 5):
    print_query_res(apps, qres)

ID: 76  -  School of Dragons  - Dist: 0.4469180079438126
Join Hiccup, Toothless, Astrid and the rest of the ...
ID: 1246  -  Merge Dragons!  - Dist: 0.3653360600610415
Discover dragon legends, magic, quests, and a secr ...
ID: 872  -  Dragon Mania Legends  - Dist: 0.34055021729744117
&quot;Dragon Mania Legends is for anyone that want ...
ID: 566  -  Heroes of Warland - PvP Shooter Arena  - Dist: 0.3180317269817844
Heroes of Warland is the most competitive online P ...
ID: 988  -  Battle Arena: Heroes Adventure - Online RPG  - Dist: 0.3020582940150625
Battle Arena: Heroes Adventure is an incredible mi ...
---------------------------------------
ID: 76  -  School of Dragons  - Dist: 0.4469180079438126
Join Hiccup, Toothless, Astrid and the rest of the ...
ID: 1246  -  Merge Dragons!  - Dist: 0.3653360600610415
Discover dragon legends, magic, quests, and a secr ...
ID: 872  -  Dragon Mania Legends  - Dist: 0.34055021729744117
&quot;Dragon Mania Legends is for anyone that want ...
ID: 566 

Querying using the descripting on an app gives back the app with a similarity of 1.0, which is also to be expected.

In [80]:
app = apps[qres[1]]
app_desc = app[1]
print("querying for ", app[0])
res = query(app_desc, 5)
for r in res:
    print_query_res(apps, r)

querying for  Traffic Racing - Extreme
ID: 1082  -  Traffic Racing - Extreme  - Dist: 1.0000000000000002
Traffic Racing-Extreme of car racing offline games ...
ID: 444  -  Racing in Highway Car 2018: City Traffic Top Racer  - Dist: 0.802217211159529
Drive car in highway traffic really challenging fo ...
ID: 1105  -  Racing in Car 2  - Dist: 0.5473163896068399
Sick of endless racing games with third person per ...
ID: 666  -  Extreme Car Driving Simulator  - Dist: 0.49082883018284174
Extreme Car Driving Simulator is the best car simu ...
ID: 462  -  Idle Racing GO: Car Clicker &amp; Driving Simulator  - Dist: 0.48492239342029814
Tap as fast as you can, collect cash and special c ...
