## Inverted Index from Scratch

In [1]:
%load_ext sql
import os
connection_string = os.environ["DATABASE_URL"]
%sql postgresql://$connection_string

strings, arrays, rows

In [2]:
%%sql
SELECT string_to_array('hello world', ' ');

 * postgresql://postgres:***@localhost/pg4e
1 rows affected.


string_to_array
"['hello', 'world']"


In [3]:
%%sql 
SELECT unnest(string_to_array('hello world', ' '));

 * postgresql://postgres:***@localhost/pg4e
2 rows affected.


unnest
hello
world


In [4]:
%%sql
DROP TABLE IF EXISTS docs;
CREATE TABLE docs(id SERIAL, doc TEXT, PRIMARY KEY(id));
INSERT INTO docs(doc) VALUES
  ('This is SQL and Python and other fun teaching stuff'),
  ('More people should learn SQL from UMSI'),
  ('UMSI also teaches Python and also SQL');
SELECT * FROM docs;


 * postgresql://postgres:***@localhost/pg4e
Done.
Done.
3 rows affected.
3 rows affected.


id,doc
1,This is SQL and Python and other fun teaching stuff
2,More people should learn SQL from UMSI
3,UMSI also teaches Python and also SQL


Break the document column into one row per word +  primary key

In [5]:
%%sql
SELECT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)
ORDER BY id;

 * postgresql://postgres:***@localhost/pg4e
24 rows affected.


id,keyword
1,This
1,is
1,SQL
1,and
1,Python
1,and
1,other
1,fun
1,teaching
1,stuff


In [6]:
%%sql
DROP TABLE IF EXISTS docs_gin;
CREATE TABLE docs_gin(
  keyword TEXT, 
  doc_id INTEGER REFERENCES docs(id) ON DELETE CASCADE
);

 * postgresql://postgres:***@localhost/pg4e
Done.
Done.


[]

In [7]:
%%sql
INSERT INTO docs_gin(doc_id, keyword)
SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)
ORDER BY id;

SELECT * FROM docs_gin;

 * postgresql://postgres:***@localhost/pg4e
22 rows affected.
22 rows affected.


keyword,doc_id
and,1
fun,1
is,1
other,1
Python,1
SQL,1
stuff,1
teaching,1
This,1
from,2


Find all the distinct documents that match a keyword

In [8]:
%%sql 
SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword = 'UMSI';

 * postgresql://postgres:***@localhost/pg4e
2 rows affected.


doc
More people should learn SQL from UMSI
UMSI also teaches Python and also SQL


We can have more than one keyword

In [9]:
%%sql
SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
where G.keyword IN ('fun', 'people');

 * postgresql://postgres:***@localhost/pg4e
2 rows affected.


doc
More people should learn SQL from UMSI
This is SQL and Python and other fun teaching stuff


We also can hanle a phrase

In [10]:
%%sql
SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
where G.keyword = ANY(string_to_array('I want to learn', ' '));

 * postgresql://postgres:***@localhost/pg4e
1 rows affected.


doc
More people should learn SQL from UMSI


In the following query, It matched with a stop word(and), which is not desirable

In [11]:
%%sql
SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
where G.keyword = ANY(string_to_array('Search for leomons and neons', ' '));

 * postgresql://postgres:***@localhost/pg4e
2 rows affected.


doc
UMSI also teaches Python and also SQL
This is SQL and Python and other fun teaching stuff
