# Semantic Search

### Part 0 - Data Collection

Query the wikipedia API and **collect all of the articles** under the following wikipedia categories:
* [Machine Learning](https://en.wikipedia.org/wiki/Category:Machine_learning)
* [Business Software](https://en.wikipedia.org/wiki/Category:Business_software)

The code should be modular enough that any valid category from Wikipedia can be queried by the code.

The results of the query will be written to PostgreSQL tables, `page` and `category`. 

In [1]:
!pip install wikipedia



In [2]:
!pip install psycopg2



In [3]:
import wikipedia
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import string
import pickle

In [4]:
from os import chdir
chdir('/home/jovyan/')

In [5]:
import library.db_helper as db
import library.functions as fy

In [6]:
fy.jsonify_wiki_category('Category:Machine_learning')

{'batchcomplete': '',
 'limits': {'categorymembers': 500},
 'query': {'categorymembers': [{'ns': 2,
    'pageid': 54972729,
    'title': 'User:CustIntelMngt/sandbox/Customer Intelligence Management'},
   {'ns': 0, 'pageid': 43385931, 'title': 'Data exploration'},
   {'ns': 0,
    'pageid': 49082762,
    'title': 'List of datasets for machine learning research'},
   {'ns': 0, 'pageid': 233488, 'title': 'Machine learning'},
   {'ns': 0, 'pageid': 53587467, 'title': 'Outline of machine learning'},
   {'ns': 0, 'pageid': 3771060, 'title': 'Accuracy paradox'},
   {'ns': 0, 'pageid': 43808044, 'title': 'Action model learning'},
   {'ns': 0,
    'pageid': 28801798,
    'title': 'Active learning (machine learning)'},
   {'ns': 0, 'pageid': 45049676, 'title': 'Adversarial machine learning'},
   {'ns': 0, 'pageid': 52642349, 'title': 'AIVA'},
   {'ns': 0, 'pageid': 30511763, 'title': 'AIXI'},
   {'ns': 0, 'pageid': 50773876, 'title': 'Algorithm Selection'},
   {'ns': 0, 'pageid': 20890511, 'titl

In [7]:
fy.dfize_category_names('Category:Machine_learning').head()

Unnamed: 0,ns,pageid,title,category
0,2,54972729,User:CustIntelMngt/sandbox/Customer Intelligen...,Category:Machine_learning
1,0,43385931,Data exploration,Category:Machine_learning
2,0,49082762,List of datasets for machine learning research,Category:Machine_learning
3,0,233488,Machine learning,Category:Machine_learning
4,0,53587467,Outline of machine learning,Category:Machine_learning


In [8]:
fy.dfize_cat_articles_only('Category:Machine_learning').head()

Unnamed: 0,ns,pageid,title,category
0,2,54972729,User:CustIntelMngt/sandbox/Customer Intelligen...,Category:Machine_learning
1,0,43385931,Data exploration,Category:Machine_learning
2,0,49082762,List of datasets for machine learning research,Category:Machine_learning
3,0,233488,Machine learning,Category:Machine_learning
4,0,53587467,Outline of machine learning,Category:Machine_learning


### Machine Learning

In [9]:
ml_df = fy.dfize_cat_articles_only('Category:Machine_learning')

In [10]:
ml_df.drop_duplicates(inplace=True)

In [11]:
ml_df['category'] = 'Machine Learning'

In [12]:
ml_df.shape

(200, 4)

In [13]:
ml_article_content = []

for article in ml_df['title'].tolist():
    page = fy.beautify_html_article(article)
    ml_article_content.append(page)

In [14]:
ml_df['text'] = ml_article_content

In [15]:
ml_df.shape

(200, 5)

In [29]:
ml_df.sample(5)

Unnamed: 0,ns,pageid,title,category,text
112,0,5721403,Machine Learning (journal),Machine Learning,Machine Learning DisciplineMachine learningLa...
21,0,40678189,Bias–variance tradeoff,Machine Learning,This article needs additional citations for ve...
199,0,47527969,Word2vec,Machine Learning,Machine learning anddata miningProblemsClassif...
140,0,23864280,Parity learning,Machine Learning,Parity learning is a problem in machine learni...
53,0,1422176,Developmental robotics,Machine Learning,"o Developmental robotics (DevRob), sometimes c..."


#### Generate Machine Learning Subcategory DataFrame (go only 1 layer deeper)

In [16]:
ml_subcat_df = fy.dfize_subcategory_article('Category:Machine_learning')

In [17]:
ml_subcat_df.drop_duplicates(inplace=True)

In [19]:
ml_subcat_article_content = []

for article in ml_subcat_df['title'].tolist()[:400]:
    page = fy.beautify_html_article(article)
    ml_subcat_article_content.append(page)

In [20]:
for article in ml_subcat_df['title'].tolist()[400:600]:
    page = fy.beautify_html_article(article)
    ml_subcat_article_content.append(page)

In [21]:
for article in ml_subcat_df['title'].tolist()[600:]:
    page = fy.beautify_html_article(article)
    ml_subcat_article_content.append(page)

In [22]:
len(ml_subcat_article_content)

831

In [23]:
ml_subcat_df['text'] = ml_subcat_article_content

In [24]:
ml_subcat_df['category'] = 'Machine Learning'

In [25]:
ml_subcat_df.sample(5)

Unnamed: 0,ns,pageid,title,category,text
30,0,33544408,GraphLab,Machine Learning,This article relies too much on references to ...
31,0,54238535,Mixture of experts,Machine Learning,Mixture of experts refers to a machine learnin...
0,0,54069,Inductive logic programming,Machine Learning,Programming paradigmsActionAgent-orientedArray...
29,0,318439,Text mining,Machine Learning,"Text mining, also referred to as text data min..."
12,0,705605,Jabberwacky,Machine Learning,This article possibly contains original resear...


In [69]:
ml_df.shape

(200, 5)

In [70]:
ml_subcat_df.shape

(831, 5)

In [74]:
mldfs = [ml_df, ml_subcat_df]

In [75]:
MLdf = pd.concat(mldfs)

In [76]:
MLdf.shape

(1031, 5)

In [80]:
MLdf.drop_duplicates(inplace=True)
MLdf.drop('ns', axis=1, inplace=True)

In [81]:
MLdf.shape

(899, 4)

### Business Software

In [40]:
bs_df = fy.dfize_cat_articles_only('Category:Business_software')

In [41]:
bs_df.drop_duplicates(inplace=True)

In [42]:
bs_df['category'] = 'Business Software'

In [44]:
bs_article_content = []

for article in bs_df['title'].tolist():
    page = fy.beautify_html_article(article)
    bs_article_content.append(page)

In [45]:
bs_df['text'] = bs_article_content

In [46]:
bs_df.drop_duplicates(inplace=True)
bs_df.drop('ns', axis=1, inplace=True)
bs_df.shape

(297, 4)

#### Generate Business Software Subcategory DataFrame (go only 1 layer deeper)

In [47]:
bs_subcat_df = fy.dfize_subcategory_article('Category:Business software')

In [48]:
bs_subcat_df.drop_duplicates(inplace=True)

In [49]:
bs_subcat_df.shape

(1461, 4)

In [50]:
# fill in the article text in batches (because it takes so long, just to ensure it's working)
bs_subcat_article_content = []

for article in bs_subcat_df['title'].tolist()[:400]:
    page = fy.beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [51]:
for article in bs_subcat_df['title'].tolist()[400:800]:
    page = fy.beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [52]:
for article in bs_subcat_df['title'].tolist()[800:1200]:
    page = fy.beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [53]:
for article in bs_subcat_df['title'].tolist()[1200:]:
    page = fy.beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [55]:
bs_subcat_df['text'] = bs_subcat_article_content

In [60]:
bs_subcat_df.drop_duplicates(inplace=True)
bs_subcat_df.drop('ns', axis=1, inplace=True)

In [57]:
bs_subcat_df['category'] = 'Business Software'

In [61]:
bs_subcat_df.shape

(1323, 4)

In [63]:
BSdf = bs_df.merge(bs_subcat_df, how='outer')
BSdf.shape

(1550, 4)

### Run the `text_cleaner` function on the text for each DataFrame

In [82]:
MLdf['text'] = MLdf['text'].apply(lambda x: fy.text_cleaner(x))

In [83]:
BSdf['text'] = BSdf['text'].apply(lambda x: fy.text_cleaner(x))

### Pickle the generated DataFrames before joining / concatenating them

In [84]:
MLdf.to_pickle('./data/MLdf.p')

In [85]:
BSdf.to_pickle('./data/BSdf.p')

#### Join the ml_total_df with the bs_total_df

In [97]:
total_df = MLdf.merge(BSdf, how='outer')

In [98]:
total_df.shape

(2449, 4)

In [99]:
total_df.to_pickle('./data/total_df.p')

In [100]:
total_df.sample(5)

Unnamed: 0,pageid,title,category,text
1101,638133,Product data management,Business Software,product data management pdm is the business fu...
92,5721283,Journal of Machine Learning Research,Machine Learning,j mach learn res doesn t exist please verify j...
2281,25373946,ActiveVOS,Business Software,the topic of this article may not meet wikiped...
844,4373337,Ross Quinlan,Machine Learning,john ross quinlan is a computer science resear...
1148,2756846,Teamcenter,Business Software,this article relies too much on references to ...


#### Generate Category Numbers

In [101]:
from sklearn.preprocessing import LabelEncoder

In [102]:
le = LabelEncoder()

In [103]:
total_df.sample(5)

Unnamed: 0,pageid,title,category,text
2093,41315870,Jahia,Business Software,jahiadeveloper s jahia solutions group stable...
2007,24719742,Spring Roo,Business Software,spring roospring roo xdeveloper s disid pivo...
2349,7793802,Green Building XML,Business Software,the green building xml schema gbxml is an open...
45,787776,Curse of dimensionality,Machine Learning,the curse of dimensionality refers to various ...
1704,32717006,ClearCheckbook.com,Business Software,this article has multiple issues please help i...


In [104]:
total_df['categoryid'] = le.fit_transform(total_df['category'])

In [105]:
total_df.sample(5)
# note 0 = Business Software category
# note 1 = Machine Learning category

Unnamed: 0,pageid,title,category,text,categoryid
1626,21641559,Workspace.com,Business Software,this article contains content that is written ...,0
1386,5085849,Tycoon City: New York,Business Software,tycoon city new yorkdeveloper s deep red games...,0
760,22999330,Markov switching multifractal,Machine Learning,this article provides insufficient context for...,1
534,9517150,Shogun (toolbox),Machine Learning,this article includes a list of references rel...,1
2078,28846270,Mobile business intelligence,Business Software,mobile business intelligence mobile bi or mobi...,0


#### Subset the total_df DataFrame into a Category table vs. a Page table, and Generate a Join Table to map each Page ID to Category

In [107]:
PAGE_df = total_df[['pageid', 'title', 'text']]
PAGE_df.sample(5)

Unnamed: 0,pageid,title,text
1309,1488410,MassBalance,this article is an orphan as no other articles...
1077,27656596,Money (software),not to be confused with microsoft money this a...
2364,18530544,SportsML-G2,sportsml g is an xml news exchange standard of...
1529,1588264,SugarCRM,sugarcrmtypeprivateindustrycrm softwarefounded...
1505,38134506,TradeCard,tradecard inc connect transact profit former t...


In [108]:
PAGE_df.shape

(2449, 3)

In [109]:
PAGE_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2449 entries, 0 to 2448
Data columns (total 3 columns):
pageid    2449 non-null int64
title     2449 non-null object
text      2449 non-null object
dtypes: int64(1), object(2)
memory usage: 76.5+ KB


In [111]:
PAGE_df.to_pickle('./data/PAGE_df.p')

In [112]:
CATEGORY_df = total_df[['category', 'categoryid']]
CATEGORY_df.sample(5)

Unnamed: 0,category,categoryid
1789,Business Software,0
425,Machine Learning,1
347,Machine Learning,1
1517,Business Software,0
1158,Business Software,0


In [114]:
CATEGORY_df.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return func(*args, **kwargs)


In [115]:
CATEGORY_df.shape

(2, 2)

In [116]:
CATEGORY_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 899
Data columns (total 2 columns):
category      2 non-null object
categoryid    2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes


In [118]:
CATEGORY_df.to_pickle('./data/CATEGORY_df.p')

In [120]:
CATEGORY_PAGE_df = total_df[['pageid', 'category']]
CATEGORY_PAGE_df.shape

(2449, 2)

In [121]:
CATEGORY_PAGE_df.sample(5)

Unnamed: 0,pageid,category
858,52469,Machine Learning
1227,4755375,Business Software
1356,20327090,Business Software
2196,50315118,Business Software
2154,2504464,Business Software


In [122]:
CATEGORY_PAGE_df.to_pickle('./data/CATEGORY_PAGE_df.p')