# Lab 1: Information Retrieval

__Students:__ Sebastian Callh sebca553, Jacob Lundberg jaclu010

### Crawling
The corpus for this assignment will be at least 1000 Googla Play app descriptions. To acquire those we crawl all the categories for their presented app urls, and then those app urls for their description. First off let's import used packages.

In [2]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import urllib.request
import nltk
import pickle
from functools import reduce
from itertools import chain
from nltk.stem.snowball import SnowballStemmer
import re
import os

To scrape the urls we use `urllib` and `re` with the helper functions defined below.

In [3]:
catreg = r'<a class=\"child-submenu-link\" href=\"(/store/apps/category/.*?)\" title=\".*?\" jsl=\"\$x 5;\" jsan=\"7.child-submenu-link,8.href,0.title\">.*?<\/a>'
catre = re.compile(catreg)
appreg = r'href=\"(/store/apps/details.*?)\"'
appre = re.compile(appreg)

play_url = 'https://play.google.com'
def scrape_cat_urls(url):
    mkdwn = urllib.request.urlopen(url).read().decode('utf-8')
    return re.findall(catre, mkdwn)

def scrape_app_urls(cat_url):
    mkdwn = urllib.request.urlopen(play_url + cat_url).read().decode('utf-8')
    return re.findall(appre, mkdwn)


To fill the quota of 1000 descriptions, some additional links are added.

In [4]:
cat_urls = scrape_cat_urls(play_url + '/store/apps')
cat_urls.append('/store/search?q=poop&c=apps')
cat_urls.append('/store/apps/new')
cat_urls.append('/store/apps/top')
app_urls = set(reduce(lambda lst, url: lst + scrape_app_urls(url), cat_urls, []))
len(app_urls)

1259

When scraping all the chosen pages we are well above the requirement, so we can now scrape the actual app descriptions (together with the app names). The descriptions are stored in a pickle file to avoid re-scraping the web pages.

In [5]:
#with open('app-descriptions.pkl', 'rb') as f:
#    apps = pickle.load(f)

EOFError: Ran out of input

In [6]:
desc = r'itemprop=\"description.*?\">.*?<div jsname=\".*?\">(.*?)</div>'
name = r'itemprop=\"name\" content=\"(.*?)\"\/>'
app_desc_re = re.compile(desc)
app_name_re = re.compile(name)

def scrape_app(app_url):
    mkdwn = urllib.request.urlopen(play_url + app_url + '&hl=en').read().decode('utf-8')
    desc = re.findall(app_desc_re, mkdwn)[0]
    name = re.findall(app_name_re, mkdwn)[0]
    return name, desc

apps = [scrape_app(url) for url in app_urls]
with open('app-descriptions.pkl', 'wb') as f:
    pickle.dump(apps, f)

### Tokenizing and tfidf representation
With the corpus acquired, we can now creat ifidf representations for the descriptions. To do this we use the `TfidfVectorizer` from `sklearn` which constructs the vectors as needed, and also allows us to specify our own tokenizer function and stopword list.

In [7]:
def load_stopwords():

    nltk_path = './nltk'
    if not os.path.exists(nltk_path):
        os.makedirs(nltk_path)
     
    nltk.download('stopwords', download_dir=nltk_path)

    custom_words =  ['br']
    with open(nltk_path + '/corpora/stopwords/english') as f:
        nltk_words = [x.strip() for x in f.readlines()]
        
    return custom_words + nltk_words
    
stopwords = load_stopwords()

[nltk_data] Downloading package stopwords to ./nltk...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [11]:
def tokenize(d): 
    stemmer = SnowballStemmer("english", ignore_stopwords=True)
    tokens = [s.lower() for s in nltk.word_tokenize(d) if s.isalpha()]
    return [stemmer.stem(w) for w in tokens]

### Construct Inverted file index (Vector Model)



d) Preprocess text using NLP techniques from __[nltk module](http://www.nltk.org/py-modindex.html)__ or spaCy.

Using nltk.download(ID) to get the corpora if it is not downloaded before. __[nltk corpora](http://www.nltk.org/nltk_data/)__

...)Compute tdidf 
eg. Using functions from __[scikit-learn module](http://scikit-learn.org/stable/modules/classes.html)__. TfidfVectorizer is used for converting a collection of raw documents to a matrix of TF-IDF features.
#### You can also build the tfidf matrix with other library or your own algorithm

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
transvector = TfidfVectorizer(tokenizer=tokenize, stop_words=stopwords, analyzer = "word") # we need to think about where stopwords need to show up
app_descrs = [app[1] for app in apps]
vocab_matrix = transvector.fit_transform(app_descrs)



# Non-english words are not stemmed correctly (!)!(!()")

# --(!)-- WARNING --(!)-- 
# #¤¤ toxic material above ¤¤#  

#        -_('_')_-

### Query Process

eg. "Dragon, Control, hero, running"

eg. "The hero controls the dragon to run."



In [13]:
def n_max(arr, n):
    indices = arr.ravel().argsort()[-n:]
    indices = (np.unravel_index(i, arr.shape) for i in indices)
    return reversed([(arr[i], i[1]) for i in indices])

In [14]:
def query(qstring, k):
    q = transvector.transform([qstring])
    
    res = cosine_similarity(q, vocab_matrix)
    
    top_k = n_max(res, k)

    return top_k

In [24]:
qres = query("The hero controls the dragon to run.", 5)

In [25]:
for row in qres:
    print("ID:", row[1], " - ", apps[row[1]][0], " - ", row[0], ". Desc:", apps[row[1]][1][0:50], "...")

ID: 76  -  School of Dragons  -  0.4469180079438126 . Desc: Join Hiccup, Toothless, Astrid and the rest of the ...
ID: 1246  -  Merge Dragons!  -  0.3653360600610415 . Desc: Discover dragon legends, magic, quests, and a secr ...
ID: 872  -  Dragon Mania Legends  -  0.34055021729744117 . Desc: &quot;Dragon Mania Legends is for anyone that want ...
ID: 566  -  Heroes of Warland - PvP Shooter Arena  -  0.3180317269817844 . Desc: Heroes of Warland is the most competitive online P ...
ID: 988  -  Battle Arena: Heroes Adventure - Online RPG  -  0.3020582940150625 . Desc: Battle Arena: Heroes Adventure is an incredible mi ...
