# Lab 1: Information Retrieval

__Students:__ 

### Crawling



a) Get the webpage content by using functions in 
__[urllib module](https://docs.python.org/3/library/urllib.html#module-urllib)__.

Other libraries are also fine to achieve the crawling.

e.g. scrapy, beautifulsoup... 

In [141]:
import requests
import bs4 as bs
import re
import pandas as pd

b) Get the links for each "see more" button to loop over and app urls 

In [142]:
base_url = 'https://play.google.com'
category_urls = [
    'https://play.google.com/store/apps?hl=en',
    'https://play.google.com/store/apps/top?hl=en',
    'https://play.google.com/store/apps/new?hl=en',
    'https://play.google.com/store/apps/category/FAMILY?hl=en',
    'https://play.google.com/store/apps/category/GAME?hl=en',
]

In [143]:
links = list()
for url in category_urls:
    resp = requests.get(url)
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    links += [link['href'] for link in soup.find_all("a", class_="see-more", href=True)]

In [144]:
def get_game_urls(sub_url):
    resp = requests.get(base_url + sub_url)
    appreg = r'href=\"(/store/apps/details.*?)\"'
    appre = re.compile(appreg)
    app_url_list = re.findall(appre, resp.text)
    return list(set(app_url_list))

In [145]:
app_suburls = list()
for link in links:
    app_suburls += get_game_urls(link)
app_suburls = list(set(app_suburls))

In [146]:
print('we have collected {} unique app urls'.format(len(app_suburls)))

we have collected 1054 unique app urls


c) Access specific webpage to get description of each app and then store the description in files.

In [147]:
def get_app_name_and_desc(app_url):
    try:
        resp = requests.get(base_url + app_url)
        soup = bs.BeautifulSoup(resp.text, 'lxml')
        title = soup.find('h1', {'itemprop' : 'name'}).text
        desc = soup.find('div', {'jsname' : 'sngebd'}).text
        return [title, desc]
    except:
        return []

In [148]:
data = list()
counter = 0
for url in app_suburls:
    counter += 1
    title_and_desc = get_app_name_and_desc(url)
    if counter % 100 == 0:
        print(f'processed {counter} apps')
    if title_and_desc:
        data.append(title_and_desc)

In [149]:
df = pd.DataFrame(columns=['title', 'desc'], data=data)

In [150]:
df.head()

Unnamed: 0,title,desc
0,Solitaire,Solitaire by Me2Zen Solitaire Games is the Wor...
1,Wonka's World of Candy – Match 3,The keys to the factory are yours! Step into y...
2,LINE: Free Calls & Messages,"LINE reshapes communication around the globe, ..."
3,DieMaus,"Wir freuen uns, dir hier die offizielle App ""D..."
4,LEGO® City game – new Arctic Explorers!,LEGO CITY ARCTIC EXPLORERS HAS ARRIVED!Explore...


### Construct Inverted file index (Vector Model)



d) Preprocess text using NLP techniques from __[nltk module](http://www.nltk.org/py-modindex.html)__ or spaCy.

Using nltk.download(ID) to get the corpora if it is not downloaded before. __[nltk corpora](http://www.nltk.org/nltk_data/)__

In [169]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/max/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/max/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [170]:
from nltk.corpus import stopwords
from nltk import word_tokenize

In [179]:
def clean_desc(desc):
    desc = desc.lower()
    words = word_tokenize(desc)
    words = [w for w in words if w not in stopwords.words('english') and w.isalnum()]
    return words

In [180]:
clean_desc(df['desc'][0])

['solitaire',
 'me2zen',
 'solitaire',
 'games',
 'world',
 '1',
 'solitaire',
 'card',
 'game',
 'android',
 'google',
 'play',
 'completely',
 'free',
 'play',
 'solitaire',
 'also',
 'known',
 'klondike',
 'solitaire',
 'patience',
 'solitaire',
 'me2zen',
 'popular',
 'solitaire',
 'card',
 'game',
 'like',
 'solitaire',
 'classic',
 'spider',
 'solitaire',
 'freecell',
 'solitaire',
 'pyramid',
 'solitaire',
 'free',
 'solitaire',
 'patience',
 'card',
 'games',
 'going',
 'love',
 'game',
 'solitaire',
 'me2zen',
 'solitaire',
 'games',
 'features',
 'classic',
 'gameplay',
 'easy',
 'single',
 'tap',
 'place',
 'card',
 'drag',
 'daily',
 'fun',
 'amazing',
 'standard',
 'klondike',
 'winning',
 'timer',
 'vegas',
 'solitaire',
 'draw',
 '1',
 'solitaire',
 'draw',
 '3',
 'choose',
 'card',
 'right',
 'left',
 'hand',
 'deal',
 'high',
 'track',
 'unlimited',
 'free',
 'unlimited',
 'free',
 'small',
 'auto',
 'complete',
 'option',
 'solve',
 'phone',
 'tablet',
 'portrait',
 '

...)Compute tdidf 
eg. Using functions from __[scikit-learn module](http://scikit-learn.org/stable/modules/classes.html)__. TfidfVectorizer is used for converting a collection of raw documents to a matrix of TF-IDF features.
#### You can also build the tfidf matrix with other library or your own algorithm

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
transvector = TfidfVectorizer()
corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And this is the third one.',
     'Is this the first document?',]
tfidf1 = transvector.fit_transform(corpus)
print(tfidf1.toarray())
print(transvector.get_feature_names())


[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.27925389 0.         0.22830836 0.         0.87501037
  0.22830836 0.         0.22830836]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


### Query Process

eg. "Dragon, Control, hero, running"

eg. "The hero controls the dragon to run."

