## Pre-requisites

For this exercise, we shall use solr in single node mode.  
There are a number of alternative ways to run solr including using a [docker image](https://hub.docker.com/_/solr) but for this exercise we shall use distributed binaries downloaded from [here](https://solr.apache.org/downloads.html).  
I assume you are using Linux or Mac. For windows users, all other commands will work with no modification except the command used for indexing documents. 

- Download and extract solr distribution archive to a directory of your choosing  
    - `curl <download-link> && tar -xzf solr-{version}.tgz`  
- Change directory to the decompressed binary directory  
    - `cd solr-{version}`  
- Launch solr in single node mode, run the process in the foreground  
    - `bin/solr start -f`  
- On aother terminal window or new tab in current terminal, create films core   
    - `bin/solr create -c films`. # Note that here we have not defined configset for films core! Solr creates the core with _default configuration files

In [18]:
import simplejson as json
import requests

host = 'http://localhost:8983/solr'
core = 'films'
search_url = host + '/' + core + '/select?q='
headers = {
    'Content-type':'application/json'
}

def search_query(query):
    query = requests.utils.quote(query)
    req = requests.get(search_url + query, headers=headers)
    if req.status_code == 200:
        result = json.loads(req.text)
        print(f"Matching documents count {result['response']['numFound']}")
        print(json.dumps(result['response']['docs'], indent=1))
    else:
        print(req.status_code, req.reason)

In [19]:
# set up custom schema settings
schema = {
    "add-field": 
    [
        {
            "name":"name", 
            "type":"text_general", 
            "multiValued":False, 
            "stored":True
        },
        {
            "name":"genre", 
            "type":"text_general", 
            "multiValued":True, 
            "stored":True
        }
    ]
}

r = requests.post(host+"/"+collection+'/schema', json=schema)
if r.status_code == 200:
    print(r.text)
else:
    print(r.status_code, r.reason)

{
  "responseHeader":{
    "status":0,
    "QTime":389}}



In [20]:
#  set up a "catchall field" by defining a copy field that will take all data from all fields
schema = {
    "add-copy-field": 
    {
        "source":"*",
        "dest":"_text_"
    }
}
r = requests.post(host+"/"+collection+'/schema', json=schema)
if r.status_code == 200:
    print(r.text)
else:
    print(r.status_code, r.reason)

{
  "responseHeader":{
    "status":0,
    "QTime":275}}



## Index sample films data after schema definition
- `bin/post -c films example/films/films.json*` for linux/mac users
- `java -jar -Dc=films -Dauto example\exampledocs\post.jar example\films\*.json` for windows users

In [22]:
# query movies where genre involves crime fiction scenes
search_query("Crime Fiction")

Matching documents count 284
[
 {
  "id": "/en/anamorph",
  "genre": [
   "Psychological thriller",
   "Crime Fiction",
   "Thriller",
   "Mystery",
   "Crime Thriller",
   "Suspense"
  ],
  "directed_by": [
   "H.S. Miller"
  ],
  "name": "Anamorph",
  "_version_": 1695550362411335681
 },
 {
  "id": "/en/blood_work",
  "directed_by": [
   "Clint Eastwood"
  ],
  "initial_release_date": [
   "2002-08-09T00:00:00Z"
  ],
  "name": "Blood Work",
  "genre": [
   "Mystery",
   "Crime Thriller",
   "Thriller",
   "Suspense",
   "Crime Fiction",
   "Detective fiction",
   "Drama"
  ],
  "_version_": 1695550362494173186
 },
 {
  "id": "/en/brigham_city_2001",
  "directed_by": [
   "Richard Dutcher"
  ],
  "name": "Brigham City",
  "genre": [
   "Mystery",
   "Indie film",
   "Crime Fiction",
   "Thriller",
   "Crime Thriller",
   "Drama"
  ],
  "_version_": 1695550362510950401
 },
 {
  "id": "/en/brother",
  "directed_by": [
   "Takeshi Kitano"
  ],
  "name": "Brother",
  "genre": [
   "Thrill

In [31]:
# apply faceting
def basic_facet(facet_field, facet_mincount=20):
    url = search_url + '*:*&rows=0&facet=on&facet.field={}&facet.mincount={}&wt=json'.format(facet_field, facet_mincount)
    req = requests.get(url, headers=headers)
    if req.status_code == 200:
        result = json.loads(req.text)
        print(f"Matching documents count {result['response']['numFound']}")
        print("\nFacet counts\n")
        print(json.dumps(result['facet_counts']['facet_fields'], indent=1))
    else:
        print(req.status_code, req.reason)

basic_facet(facet_field='genre', facet_mincount=50)

Matching documents count 1100

Facet counts

{
 "genre": [
  "film",
  793,
  "drama",
  569,
  "comedy",
  417,
  "romance",
  270,
  "thriller",
  266,
  "fiction",
  263,
  "action",
  208,
  "crime",
  191,
  "cinema",
  184,
  "adventure",
  167,
  "world",
  167,
  "indie",
  144,
  "horror",
  122,
  "family",
  116,
  "musical",
  93,
  "romantic",
  93,
  "fantasy",
  87,
  "science",
  82,
  "mystery",
  78,
  "biographical",
  74,
  "documentary",
  73,
  "animation",
  60,
  "sports",
  58,
  "teen",
  58,
  "historical",
  53,
  "of",
  51
 ]
}


In [52]:
# range faceting, used to partition numeric and date fields into ranges
def advanced_faceting(facet_condition, out_var):
    url = search_url + '*:*&rows=0&facet=on&{}&wt=json'.format(facet_condition)
    req = requests.get(url, headers=headers)
    if req.status_code == 200:
        result = json.loads(req.text)
        print(f"Matching documents count {result['response']['numFound']}")
        return result['facet_counts'][out_var] 
    else:
        print(req.status_code, req.reason)
    return None
# Request for all films and group them by year starting with 20 years ago - earliest film release date is 2000 - and ending today less ten years (2010).
facet_condition = 'facet.range=initial_release_date&facet.range.start=NOW-20YEAR&facet.range.end=NOW-10YEAR&facet.range.gap={}YEAR'.format(requests.utils.quote("+1"))
facets = advanced_faceting(facet_condition, 'facet_ranges')
print(json.dumps(facets, indent=1))


Matching documents count 1100
{
 "initial_release_date": {
  "counts": [
   "2001-03-29T09:43:09.582Z",
   103,
   "2002-03-29T09:43:09.582Z",
   114,
   "2003-03-29T09:43:09.582Z",
   134,
   "2004-03-29T09:43:09.582Z",
   157,
   "2005-03-29T09:43:09.582Z",
   197,
   "2006-03-29T09:43:09.582Z",
   127,
   "2007-03-29T09:43:09.582Z",
   34,
   "2008-03-29T09:43:09.582Z",
   8,
   "2009-03-29T09:43:09.582Z",
   5,
   "2010-03-29T09:43:09.582Z",
   1
  ],
  "gap": "+1YEAR",
  "start": "2001-03-29T09:43:09.582Z",
  "end": "2011-03-29T09:43:09.582Z"
 }
}


In [65]:
# Pivot Facets, also known as "decision trees", allowing two or more fields to be nested for all the various possible combinations
facet_condition = 'facet=on&facet.pivot=genre,directed_by'
facets = advanced_faceting(facet_condition, 'facet_pivot')
if facets is not None:
    print(json.dumps(facets['genre,directed_by'][9], indent=1))

Matching documents count 1100
{
 "field": "genre",
 "value": "adventure",
 "count": 167,
 "pivot": [
  {
   "field": "directed_by",
   "value": "david",
   "count": 10
  },
  {
   "field": "directed_by",
   "value": "john",
   "count": 9
  },
  {
   "field": "directed_by",
   "value": "tim",
   "count": 5
  },
  {
   "field": "directed_by",
   "value": "brian",
   "count": 4
  },
  {
   "field": "directed_by",
   "value": "kevin",
   "count": 4
  },
  {
   "field": "directed_by",
   "value": "paul",
   "count": 4
  },
  {
   "field": "directed_by",
   "value": "robert",
   "count": 4
  },
  {
   "field": "directed_by",
   "value": "scott",
   "count": 4
  },
  {
   "field": "directed_by",
   "value": "bob",
   "count": 3
  },
  {
   "field": "directed_by",
   "value": "boll",
   "count": 3
  },
  {
   "field": "directed_by",
   "value": "eric",
   "count": 3
  },
  {
   "field": "directed_by",
   "value": "kitamura",
   "count": 3
  },
  {
   "field": "directed_by",
   "value": "molina

## Updating Data  
Solr uses `uniqueKey` field called `id` to uniquely identify indexed documents. Whenever you `POST` to `Solr` to add a document with the same value for the `uniqueKey` as an existing document, `Solr` automatically replaces it for you. In other words, to update a document with fresh details, `POST` the new document with same `id` as the existing document to be updated.

## Deleting Data

Execute the following command to delete a specific document:

`bin/post -c films -d "<delete><id>/en/45_2006</id></delete>"`

To delete all documents, you can use "delete-by-query" command like:

`bin/post -c films -d "<delete><query>*:*</query></delete>"`

You can also modify the above to only delete documents that match a specific query.

`bin/post -c films -d "<delete><query>genre:crime</query></delete>"`