# Lab 1: Information Retrieval

**Students: Fanny Karelius (fanka300), Milda Poceviciute (milpo192)**

## Step 1 - Crawling

In this lab we used urllib package to fetch the HTMLs from google play app store.

In [2]:
from bs4 import BeautifulSoup
import urllib.request
import re

html_page = urllib.request.urlopen("https://play.google.com/store/apps/top")
soup = BeautifulSoup(html_page,"lxml")


From the HTML of the main page we found the hrefs of the different categories of the apps. We are going to use them for crawling the data on 1000 apps from different categories. In total we found 61 unique categories.

In [3]:
hrefs = list()
for link in soup.findAll('a'):
    hrefs.append(link.get('href'))

In [4]:

final_hrefs = list()
for href in hrefs:
    #print(href[0:20])
    if href[0:20] ==  '/store/apps/category':
        final_hrefs.append(href)

In [5]:
final_hrefs = list(set(final_hrefs))
print(len(final_hrefs))

61


In [6]:
import urllib.request
import re
#x = urllib.request.urlopen('https://play.google.com/store/apps/category/GAME?hl=en').read().decode('utf-8')
#print(x)
final_list = list()
for item in final_hrefs[0:55]:
    url = 'https://play.google.com' + item
    #print(url)
    x = urllib.request.urlopen(url).read().decode('utf-8')    
    appreg = r'href=\"(/store/apps/details.*?)\"'
    appre = re.compile(appreg)
    app_url_list = re.findall(appre,x)
    app_url_list = list(set(app_url_list))
    final_list.append(app_url_list)

After fetching the HTMLs from the different categories, we end up having 1179 distinctive apps.

In [134]:
from functools import reduce
apps = reduce(lambda x,y: x+ y, final_list, [])
len(apps)


1179

Now we extract the details (descriptions) of each of the apps.

In [119]:
desc_list = list()
for app in apps:
    #print(app)
    url = 'https://play.google.com' + app + '&hl=en'
    #print(url)
    x = urllib.request.urlopen(url).read().decode('utf-8')
    appreg = r'itemprop=\"description.*?\">.*?<div jsname=\".*?\">.*?</div>'
    appre = re.compile(appreg)
    app_url_list = re.findall(appre,x)
    app_url_list = list(set(app_url_list))
    desc_list.append(app_url_list)

In [121]:
desc_final = list()
for item in desc_list:
    str1 = ''.join(item)
#print(str1)
    soup = BeautifulSoup(str1,"lxml")
    desc_final.append(soup.div.get_text())

At the end of this section we have a raw data of the apps descriptions that we found at the google play website.

## Step 2 - Vector Model

In [123]:
import nltk
#nltk.download()
#nltk.download('stopwords')

In this part we split the descriptions into lowercase words, remove non-alpha-numeric symbols and stop words, and finally used stemming to keep just the stems of the words.

In [124]:
# split into words
tokenizer = nltk.tokenize.RegexpTokenizer(r'\b[^\d\W]+\b')
word_list = [tokenizer.tokenize(w) for w in desc_final]
# make everything lowercase
word_list = [[w.lower() for w in line] for line in word_list]
len(word_list)

1179

In [125]:
# remove stop words
from nltk.corpus import stopwords
filtered_words = [[word for word in word_list_item if word not in stopwords.words('english')] for word_list_item in word_list]
len(filtered_words)

1179

In [126]:
# Stem the words
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed_words = [[ps.stem(word) for word in word_list_item] for word_list_item in word_list]
#stemmed_words

In [127]:
# join the documents into seperate strings in one list
documents = [' '.join(x) for x in stemmed_words]


We use the sklearn package to calculate the weights for each word: tf-idf weights. This represents how common each word is in each document. From the dimensions of the resulting matrix, we can see tha there are 1179 documents, and in total they contain 13519 distinctive words.

In [128]:
from sklearn.feature_extraction.text import TfidfVectorizer
transvector = TfidfVectorizer()
fitted_docs = transvector.fit(documents)
tfidf1 = transvector.fit_transform(documents)

print(tfidf1.shape)


(1179, 13519)


## Step 3 - Query Process

Here we wrote a function that does the same pre-processing on the query words (sentences) as we did for the apps' descriptions. 

In [129]:
# search keywords
def prep_query(words):
    # split into words
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\b[^\d\W]+\b')
    words2 = tokenizer.tokenize(words)
    # make everything lowercase
    words2 = [w.lower() for w in words2]
    words2 = [word for word in words2 if word not in stopwords.words('english')]
    ps = PorterStemmer()
    query_words =[ps.stem(word) for word in words2]
    return(query_words)

keywords1 = "I like to listening to the music"
query1 = prep_query(keywords1)
print(query1)

keywords2 = "The hero controls the dragon to run."
query2 = prep_query(keywords2)
print(query2)

['like', 'listen', 'music']
['hero', 'control', 'dragon', 'run']


Below we compute the tf-idf weights of the query (we used the function TfidfVectorizer in combination with transform function). The TfidfVectorizer created the vocabulary based on the app descriptions (when we used it with the fit function above), and also the frequency table (when used with the transform function). We create the frequency table of our new words (the query) based on that vocabulary (tf idf of the query). Then, the cosine distance between the query and the documents' frequency tables is computed. The higher the similarity, the better match the document is supposed to be to our query. So we order the document list based on this measure.

In [130]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def query_search(query):
    values = transvector.transform(query)
    #print(values.shape)
    similarity = cosine_similarity(values.toarray(),tfidf1.toarray())
    #print(similarity.shape)
    sumsim = similarity.sum(axis=0)
    document_ranking = np.argsort(-sumsim)
    return(document_ranking)

In [131]:
docs_query1 = query_search(query1)
docs_query2 = query_search(query2)

In [132]:
print(docs_query1)
print(docs_query2)

[ 492  561 1145 ...,  453  440 1178]
[ 385 1080  528 ...,  441  424 1178]


Our search algorithm chose app "youtube.music" as the best match for a query "I like to listening to the music". It seems to be a really good fit as the description does repeat the words "music" and "listen" several times.

In [137]:
# query 1 top match
print(apps[docs_query1[0]])
print(documents[docs_query1[0]])

/store/apps/details?id=com.google.android.apps.youtube.music
youtub music is a new music app that allow you to easili find what you re look for and discov new music get playlist and recommend serv to you base on your context tast and what s trend around you a new music stream servic from youtub thi is a complet reimagin music servic with offici releas from your favorit artist find the music you want easili find the album singl live perform cover and remix you re look for don t know a song s name just search the lyric or describ it discov new music get music recommend base on tast locat and time of day use the hotlist to keep up with what s trend uninterrupt listen with music premium listen ad free don t worri about your music stop when you lock your screen or use other app download your favorit or let us do it for you by enabl offlin mixtap get one free month of music premium to listen ad free offlin and with your screen lock then pay just a month exist youtub red or googl play music m

The search algorithm chose app "mattel.eahdragons" as the best match for a query "The hero controls the dragon to run.". When we look through the description of this app, it seems to fit our query sentence pretty well.

In [136]:
# query 2 top match
print(apps[docs_query2[0]])
print(documents[docs_query2[0]])

/store/apps/details?id=com.mattel.eahdragons
babi dragon have arriv at ever after high just in time for dragon game hatch play feed style and train your babi dragon for the game fill her happi meter and your dragon will bring you a fun new surpris visit your dragon companion everyday to collect enough stamp to unlock a free friend brushfir charm featur style decor style your dragon with cute accessori pattern and magic aura decor your dragon s camp with fun anim decor and furnitur play fun game more babi dragon need lot of love and attent feed and play with your babi dragon to keep her healthi and happi don t forget to play with their cute critter friend teach your babi dragon how to fli dure flight train and collect gem along the way but be care to avoid those wick cloud your dragon is truli magic and it s time to put her fire breath skill to the test with magic train target practic develop your dragon s memori and match similar item through memori train reach level and go river raft 