# Lab 1: Information Retrieval

__Students:__ Sebastian Callh sebca553, Jacob Lundberg jaclu010

### Crawling
The corpus for this assignment will be at least 1000 Googla Play app descriptions. To acquire those we crawl all the categories for their presented app urls, and then those app urls for their description. First off let's import used packages.

In [8]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import urllib.request
import nltk
import pickle
from functools import reduce
from itertools import chain
from nltk.stem.snowball import SnowballStemmer
import re
import os

To scrape the urls we use `urllib` and `re` with the helper functions defined below.

In [3]:
catreg = r'<a class=\"child-submenu-link\" href=\"(/store/apps/category/.*?)\" title=\".*?\" jsl=\"\$x 5;\" jsan=\"7.child-submenu-link,8.href,0.title\">.*?<\/a>'
catre = re.compile(catreg)
appreg = r'href=\"(/store/apps/details.*?)\"'
appre = re.compile(appreg)

play_url = 'https://play.google.com'
def scrape_cat_urls(url):
    mkdwn = urllib.request.urlopen(url).read().decode('utf-8')
    return re.findall(catre, mkdwn)

def scrape_app_urls(cat_url):
    mkdwn = urllib.request.urlopen(play_url + cat_url).read().decode('utf-8')
    return re.findall(appre, mkdwn)


To fill the quota of 1000 descriptions, some additional links are added.

In [4]:
cat_urls = scrape_cat_urls(play_url + '/store/apps')
cat_urls.append('/store/search?q=poop&c=apps')
cat_urls.append('/store/apps/new')
cat_urls.append('/store/apps/top')
app_urls = set(reduce(lambda lst, url: lst + scrape_app_urls(url), cat_urls, []))
len(app_urls)

1398

When scraping all the chosen pages we are well above the requirement, so we can now scrape the actual app descriptions (together with the app names). The descriptions are stored in a pickle file to avoid re-scraping the web pages.

In [5]:
desc = r'itemprop=\"description.*?\">.*?<div jsname=\".*?\">(.*?)</div>'
name = r'itemprop=\"name\" content=\"(.*?)\"\/>'
app_desc_re = re.compile(desc)
app_name_re = re.compile(name)

def scrape_app(app_url):
    mkdwn = urllib.request.urlopen(play_url + app_url + '&hl=en').read().decode('utf-8')
    desc = re.findall(app_desc_re, mkdwn)[0]
    name = re.findall(app_name_re, mkdwn)[0]
    return name, desc

apps = [scrape_app(url) for url in app_urls]
with open('app-descriptions.pkl', 'wb') as f:
    pickle.dump(apps, f)

OSError: [Errno 113] No route to host

### Tokenizing and tfidf representation
With the corpus acquired, we can now creat ifidf representations for the descriptions. To do this we use the `TfidfVectorizer` from `sklearn` which constructs the vectors as needed, and also allows us to specify our own tokenizer function and stopword list.

In [20]:
def load_stopwords():

    nltk_path = './nltk'
    if not os.path.exists(nltk_path):
        os.makedirs(nltk_path)
     
    nltk.download('stopwords', download_dir=nltk_path)

    custom_words =  ['br']
    with open(nltk_path + '/corpora/stopwords/english') as f:
        nltk_words = [x.strip() for x in f.readlines()]
        
    return custom_words + nltk_words
    
stopwords = load_stopwords()


[nltk_data] Downloading package stopwords to ./nltk...
[nltk_data]   Package stopwords is already up-to-date!


['br',
 'i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',

In [237]:
def tokenize(d): 
    stemmer = SnowballStemmer("english", ignore_stopwords=True)
    tokens = [s.lower() for s in nltk.word_tokenize(d) if s.isalpha()]
    return [stemmer.stem(w) for w in tokens if not w in stopwords] 

#docs = [re.findall(appre, d)[0] for d in descriptions]
#names = [re.findall(appname, d)[0] for d in descriptions]
#tokens = reduce(lambda lst, d: lst + [process(d)], docs, [])

### Construct Inverted file index (Vector Model)



d) Preprocess text using NLP techniques from __[nltk module](http://www.nltk.org/py-modindex.html)__ or spaCy.

Using nltk.download(ID) to get the corpora if it is not downloaded before. __[nltk corpora](http://www.nltk.org/nltk_data/)__

In [None]:
# import nltk
#nltk.download('stopwords')
#nltk.download('punkt')

...)Compute tdidf 
eg. Using functions from __[scikit-learn module](http://scikit-learn.org/stable/modules/classes.html)__. TfidfVectorizer is used for converting a collection of raw documents to a matrix of TF-IDF features.
#### You can also build the tfidf matrix with other library or your own algorithm

In [238]:
from sklearn.feature_extraction.text import TfidfVectorizer
transvector = TfidfVectorizer(tokenizer=tokenize, stop_words=stopwords, analyzer = "word") # we need to think about where stopwords need to show up
app_descrs = [app[1] for app in apps]
vocab_matrix = transvector.fit_transform(app_descrs)



# Non-english words are not stemmed correctly (!)!(!()")

# --(!)-- WARNING --(!)-- 
# #¤¤ toxic material above ¤¤#  

#        -_('_')_-








#[
#     'This is the first document.',
#     'This is the second second document.',
#     'And this is the third one.',
#     'Is this the first document?',]
#print(X.toarray())
#print(transvector.get_feature_names())

#def tfidf(corpus):
#    return transvector.fit_transform(corpus)

# idfs.item.toarray() to call all 
#idfs = [tfidf(corp) for corp in tokens]

### Query Process

eg. "Dragon, Control, hero, running"

eg. "The hero controls the dragon to run."



In [239]:
def n_max(arr, n):
    indices = arr.ravel().argsort()[-n:]
    indices = (np.unravel_index(i, arr.shape) for i in indices)
    return reversed([(arr[i], i[1]) for i in indices])

In [240]:
def query(qstring, k):
    q = transvector.transform([qstring])
    
    res = cosine_similarity(q, vocab_matrix)
    
    top_k = n_max(res, k)

    return top_k

In [253]:
qres = query("ponyville", 5)

In [254]:
for row in qres:
    print("ID:", row[1], " - ", apps[row[1]][0], " - ", row[0], ". Desc:", apps[row[1]][1][0:50], "...")

ID: 1165  -  My Little Pony Rainbow Runners  -  0.12055867793548761 . Desc: Run, jump, fly and restore the colors of the world ...
ID: 1178  -  Deep Sleep and Relax Hypnosis  -  0.0 . Desc: Do you have trouble sleeping or getting into a rel ...
ID: 394  -  Kids Tap and Color (Lite)  -  0.0 . Desc: Coloring Book Tap &amp; Color is an interactive co ...
ID: 388  -  NFL  -  0.0 . Desc: The official app of the NFL is the best, pure foot ...
ID: 389  -  Chitose.  A crypto currency prices viewer on wear  -  0.0 . Desc: Chitose is a crypto currency prices viewer on your ...


In [246]:
apps[899]

('Soy Luna - Hits Music Lyrics',
 'Sing by reading the best music lyrics Soy Luna - Hits Music Lyrics.<br>You can hear Soy Luna music - Hits Music Lyrics anytime and anywhere you need.<br><br>Simply download you can get all of Soy Luna&#39;s music on your mobile.<br><br>soy luna open music en la mansion<br>soy luna open music ambar y matteo<br>soy luna open music a rodar mi vida<br>soy luna open amber music delfi y jazmin<br>soy luna open music ambre<br>soy luna open music auslosung<br>soy luna open music alzo mi bandera<br>soy luna open music amber canta mirame a mi<br>soy luna open music auf deutsch<br>soy luna open music i&#39;d be crazy<br>soy luna open music roller band<br>soy luna open music cuando bailo<br>Bailes de soy luna open music<br>soy luna open music ad be crazy<br>soy luna open music con letra<br>soy luna open music chica vs chico completo letra<br>soy luna open music chica vs chica<br>soy luna open music chica vs chico completo capitulo<br>soy luna open music corazo<br