# Lab 1: Information Retrieval

__Students:__ Sebastian Callh sebca553, Jacob Lundberg jaclu010

### Crawling



The corpus for this assignment will be at least 1000 Googla Play app descriptions. To acquire those we crawl all the categories for their presented app urls, and then those app urls for their description. First off let's import used packages.

In [2]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import urllib.request
import nltk
import pickle
from functools import reduce
from itertools import chain
from nltk.stem.snowball import SnowballStemmer
import re

To scrape the urls we use `urllib` and `re` with the helper functions defined below.

In [3]:
catreg = r'<a class=\"child-submenu-link\" href=\"(/store/apps/category/.*?)\" title=\".*?\" jsl=\"\$x 5;\" jsan=\"7.child-submenu-link,8.href,0.title\">.*?<\/a>'
catre = re.compile(catreg)
appreg = r'href=\"(/store/apps/details.*?)\"'
appre = re.compile(appreg)

play_url = 'https://play.google.com'
def scrape_cat_urls(url):
    mkdwn = urllib.request.urlopen(url).read().decode('utf-8')
    return re.findall(catre, mkdwn)

def scrape_app_urls(cat_url):
    mkdwn = urllib.request.urlopen(play_url + cat_url).read().decode('utf-8')
    return re.findall(appre, mkdwn)


To fill the quota of 1000 descriptions, some additional links are added.

In [4]:
cat_urls = scrape_cat_urls(play_url + '/store/apps')
cat_urls.append('/store/search?q=poop&c=apps')
cat_urls.append('/store/apps/new')
cat_urls.append('/store/apps/top')
app_urls = set(reduce(lambda lst, url: lst + scrape_app_urls(url), cat_urls, []))
len(app_urls)

1398

When scraping all the chosen pages we are well above the requirement, so we can now scrape the actual app descriptions (together with the app names). The descriptions are stored in a pickle file to avoid re-scraping the web pages.

In [None]:
desc = r'itemprop=\"description.*?\">.*?<div jsname=\".*?\">(.*?)</div>'
name = r'itemprop=\"name\" content=\"(.*?)\"\/>'
app_desc_re = re.compile(desc)
app_name_re = re.compile(name)

def scrape_app(app_url):
    mkdwn = urllib.request.urlopen(play_url + app_url + '&hl=en').read().decode('utf-8')
    desc = re.findall(app_desc_re, mkdwn)[0]
    name = re.findall(app_name_re, mkdwn)[0]
    return name, desc

apps = [scrape_app(url) for url in app_urls]
with open('app-descriptions.pkl', 'wb') as f:
    pickle.dump(apps, f)

### Tokenizing and tfidf representation
With the corpus acquired, we can now creat ifidf representations for the descriptions. To do this we use the `TfidfVectorizer` from `sklearn` which constructs the vectors as needed, and also allows us to specify our own tokenizer function and stopword list.

In [192]:
def download_stopwords():
    nltk.download('stopwords')
    nltk.download('punkt')

    path = './nltk_data/corpora/stopwords/english'
    with open(path) as f:
        stopwords = f.readlines()
        stopwords = [x.strip() for x in stopwords]
        stopwords.append('br')

stemmer = SnowballStemmer("english", ignore_stopwords=True) # we need to think about where stopwords need to show up

[nltk_data] Downloading package stopwords to /home/jacke/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jacke/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [194]:
#from collections import Counter
#from sklearn.preprocessing import normalize

def tokenize(d): 
    tokens = [s.lower() for s in nltk.word_tokenize(d) if s.isalpha()]
    return [stemmer.stem(w) for w in tokens if not w in stopwords] # we need to think about where stopwords need to show up

#docs = [re.findall(appre, d)[0] for d in descriptions]
#names = [re.findall(appname, d)[0] for d in descriptions]
#tokens = reduce(lambda lst, d: lst + [process(d)], docs, [])

### Construct Inverted file index (Vector Model)



d) Preprocess text using NLP techniques from __[nltk module](http://www.nltk.org/py-modindex.html)__ or spaCy.

Using nltk.download(ID) to get the corpora if it is not downloaded before. __[nltk corpora](http://www.nltk.org/nltk_data/)__

In [None]:
# import nltk
#nltk.download('stopwords')
#nltk.download('punkt')

...)Compute tdidf 
eg. Using functions from __[scikit-learn module](http://scikit-learn.org/stable/modules/classes.html)__. TfidfVectorizer is used for converting a collection of raw documents to a matrix of TF-IDF features.
#### You can also build the tfidf matrix with other library or your own algorithm

In [195]:
from sklearn.feature_extraction.text import TfidfVectorizer
transvector = TfidfVectorizer(tokenizer=tokenize, stop_words=stopwords) # we need to think about where stopwords need to show up
app_descrs = [app[1] for app in apps]
X = transvector.fit_transform(app_descrs)


#[
#     'This is the first document.',
#     'This is the second second document.',
#     'And this is the third one.',
#     'Is this the first document?',]
#print(X.toarray())
#print(transvector.get_feature_names())

#def tfidf(corpus):
#    return transvector.fit_transform(corpus)

# idfs.item.toarray() to call all 
#idfs = [tfidf(corp) for corp in tokens]

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
['aa', 'aac', 'aard', 'aarhus', 'aaron', 'aastock', 'aat', 'ab', 'abandon', 'abbi', 'abbott', 'abbrevi', 'abc', 'abcd', 'abcmous', 'abcsong', 'abdomin', 'abf', 'abgefragt', 'abgeschaut', 'abi', 'abid', 'abigail', 'abil', 'abjad', 'abl', 'ablaz', 'ablösesummen', 'abnorm', 'aboard', 'abomin', 'abonn', 'abonnemang', 'abonnemanget', 'abonniert', 'abono', 'abrindo', 'abrir', 'abroad', 'abruf', 'abrufen', 'absenc', 'absolut', 'absolutley', 'absoluto', 'absorb', 'abspielbar', 'abspielen', 'abstand', 'abstract', 'abu', 'abund', 'abyss', 'ac', 'acaan', 'academ', 'academi', 'acalmar', 'acc', 'acced', 'acceder', 'accediendo', 'acceler', 'acceleromet', 'accent', 'accept', 'accepté', 'acceso', 'access', 'accessor', 'accessori', 'accid', 'accident', 'acclaim', 'accommod', 'accompani', 'accomplish', 'accord', 'accordion', 'account', 'accueil', 'ac

### Query Process

eg. "Dragon, Control, hero, running"

eg. "The hero controls the dragon to run."



In [196]:
def n_max(arr, n):
    indices = arr.ravel().argsort()[-n:]
    indices = (np.unravel_index(i, arr.shape) for i in indices)
    return reversed([(arr[i], i[1]) for i in indices])

In [197]:
def query(qstring, k):
    q = transvector.transform([qstring])
    
    res = cosine_similarity(q, X)
    
    top_k = n_max(res, k)

    return top_k

In [203]:
qres = query("Top Anime Wallpaper", 5)

In [204]:
for row in qres:
    print("ID:", row[1], " - ", names[row[1]], " - ", row[0], ". Desc:", docs[row[1]][0:50], "...")

ID: 899  -  Top Anime Wallpaper  -  0.8633632511232048 . Desc: if you love Anime you will find in this app Top An ...
ID: 1317  -  Live Wallpapers HD &amp; Backgrounds 4k/3D - WALLOOP™  -  0.5527035968831979 . Desc: <b>Walloop is the first collection of Live Wallpap ...
ID: 273  -  Electric Screen for Prank Live Wallpaper  -  0.5080523359716219 . Desc: <b>Tired of current wallpaper?😆 Now this Electric  ...
ID: 441  -  4K Wallpapers - Auto Wallpaper Changer  -  0.47851780341050854 . Desc: <b>4K Wallpapers (4K Backgrounds) - Live Wallpaper ...
ID: 519  -  Wallpapers HD, 4K Backgrounds  -  0.47687611272814234 . Desc: — Tailored Wallpapers for your device.<br>Attached ...
