# Lab 1: Information Retrieval

### Step 1: retrieve at least 1000 app titles and descriptions from google play

First we get the links for the apps that we want to collect information about.

In [1]:
import requests
import bs4 as bs
import re
import pandas as pd

In [2]:
base_url = 'https://play.google.com'
category_urls = [
    'https://play.google.com/store/apps?hl=en',
    'https://play.google.com/store/apps/top?hl=en',
    'https://play.google.com/store/apps/new?hl=en',
    'https://play.google.com/store/apps/category/FAMILY?hl=en',
    'https://play.google.com/store/apps/category/GAME?hl=en',
]

In [3]:
links = list()
for url in category_urls:
    resp = requests.get(url)
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    links += [link['href'] for link in soup.find_all("a", class_="see-more", href=True)]

In [4]:
def get_game_urls(sub_url):
    resp = requests.get(base_url + sub_url)
    appreg = r'href=\"(/store/apps/details.*?)\"'
    appre = re.compile(appreg)
    app_url_list = re.findall(appre, resp.text)
    return list(set(app_url_list))

In [5]:
app_suburls = list()
for link in links:
    app_suburls += get_game_urls(link)
app_suburls = list(set(app_suburls))

In [6]:
print('we have collected {} unique app urls'.format(len(app_suburls)))

we have collected 958 unique app urls


Now we can gather the app title and description by following the collected links.

In [7]:
def get_app_name_and_desc(app_url):
    try:
        resp = requests.get(base_url + app_url)
        soup = bs.BeautifulSoup(resp.text, 'lxml')
        title = soup.find('h1', {'itemprop' : 'name'}).text
        desc = soup.find('div', {'jsname' : 'sngebd'}).text
        return [title, desc]
    except:
        return []

In [8]:
data = list()
counter = 0
for url in app_suburls:
    counter += 1
    title_and_desc = get_app_name_and_desc(url)
    if counter % 100 == 0:
        print(f'processed {counter} apps')
    if title_and_desc:
        data.append(title_and_desc)

processed 100 apps
processed 200 apps
processed 300 apps
processed 400 apps
processed 500 apps
processed 600 apps
processed 700 apps
processed 800 apps
processed 900 apps


The collected app titles and descriptions can be put into a pandas dataframe to simplify later indexing into the data.

In [9]:
df = pd.DataFrame(columns=['title', 'desc'], data=data)

### Step 2: Clean our collected app descriptions

We lowecase, tokenize and stem all the description strings for our apps.

In [10]:
import nltk
# download the required parts of nltk needed
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/max/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/max/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [11]:
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer

In [12]:
ps = PorterStemmer()

In [13]:
def clean(desc):
    """Lowercase, tokenize and stem a string"""
    desc = desc.lower()
    words = word_tokenize(desc)
    words = ' '.join([ps.stem(w) for w in words if w not in stopwords.words('english') and w.isalpha()])
    return words

In [14]:
# apply our cleaning function to every description
df['desc'] = df['desc'].apply(lambda x: clean(x))

Some investigation showed that we have a few duplicate titles, we can remove those.

In [15]:
df = df.drop_duplicates(subset='title')

### Step 3: Create tfidf model

The inverse document frequency (idf) is calculated as:

$$\log{\frac{N}{1 + D : t \in D}}$$

Where N are the total number of documents in the corpus and D are the number of documents which contains the term t. To avoid division by zero when there are no occurrences of the term in any document, the has an addition by 1.

The term frequency is calculated as, using the frequency of the term in the document divided by the frequency of the most occuring term in the document ("normalizing" to prevent bias towards longer documents):

$$\frac{tf_d(t)}{max(tf_{d^*}(t))}$$

The weights vectors, tfidf, are now calculated as:

$$tfidf(t, d, D, N) = tf_d(t, d) * idf(t, D, N)$$

By calculating the cosine similarity of a query vector to the tfidf weight vectors, we can find the most similar documents to our query.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
    use_idf=True, 
    smooth_idf=True,
    analyzer='word'
)
corpus = df['desc'].values
tfidf = vectorizer.fit_transform(corpus)

In [17]:
print(tfidf.shape)

(938, 11245)


Let's try to calculate the cosine similarity of the first document with all the others

In [18]:
from sklearn.metrics.pairwise import linear_kernel

In [19]:
cosine_similarities = linear_kernel(tfidf[0], tfidf).flatten()

If we sort on the similarities, we can find the indices of the most similar documents, where the last document should be at index 0, since te similarity of the first document with itself should be one.

In [20]:
idxs = cosine_similarities.argsort()[-5:]

In [21]:
for i in reversed(idxs):
    display(df.iloc[i])

title                                     The Game of Life
desc     make choic get paid lose attend colleg accept ...
Name: 0, dtype: object

title                              RISK: Global Domination
desc     full onlin multiplay matchmak everybodi want r...
Name: 46, dtype: object

title                         Infinite Word Search Puzzles
desc     infinit word search take american classic diff...
Name: 631, dtype: object

title                            All-in-One Mahjong 3 FREE
desc     mahjong addict solitair game player challeng e...
Name: 212, dtype: object

title                                 All-in-One Mahjong 3
desc     mahjong addict solitair game player challeng e...
Name: 590, dtype: object

Looks really good! The most similar documents are all versions of Solitaire. Let's pack this into a function.

In [22]:
def find_k_similar_docs(query, vectorizer, tfidf, k):
    """Prints the k most similar documents to the input query"""
    query_vector = vectorizer.transform([clean(query)])
    cosine_similarities = linear_kernel(query_vector, tfidf).flatten()
    idxs = cosine_similarities.argsort()[-k:]
    for i in reversed(idxs):
        display(df.iloc[i])

### Query Process

eg. "Dragon, Control, hero, running"

eg. "The hero controls the dragon to run."



In [23]:
find_k_similar_docs("Dragon, Control, hero, running", vectorizer, tfidf, 5)

title                       Baby Dragons: Ever After High™
desc     babi dragon arriv ever high time dragon game h...
Name: 528, dtype: object

title                                    School of Dragons
desc     join hiccup toothless astrid rest vike school ...
Name: 181, dtype: object

title                                       Merge Dragons!
desc     discov dragon legend magic quest secret land e...
Name: 929, dtype: object

title                         Little Pets Animal Guardians
desc     explor beauti craft level play littl anim tame...
Name: 757, dtype: object

title                          DRAGON BALL Z DOKKAN BATTLE
desc     connect ki sphere unleash power kamehameha ult...
Name: 216, dtype: object

In [24]:
find_k_similar_docs("The hero controls the dragon to run.", vectorizer, tfidf, 5)

title                       Baby Dragons: Ever After High™
desc     babi dragon arriv ever high time dragon game h...
Name: 528, dtype: object

title                                    School of Dragons
desc     join hiccup toothless astrid rest vike school ...
Name: 181, dtype: object

title                                       Merge Dragons!
desc     discov dragon legend magic quest secret land e...
Name: 929, dtype: object

title                         Little Pets Animal Guardians
desc     explor beauti craft level play littl anim tame...
Name: 757, dtype: object

title                          DRAGON BALL Z DOKKAN BATTLE
desc     connect ki sphere unleash power kamehameha ult...
Name: 216, dtype: object