# Notebook 01: Data Exploration & Lifestyle Classification

**Project**: Lifestyle-Based COG Stratification

**Goal**: Assess `ncbi_env` coverage, build a genome-level lifestyle classifier, and identify target species for COG analysis.

**Data Sources**:
- `kbase_ke_pangenome.ncbi_env` — NCBI environment metadata (4.1M rows, EAV format)
- `kbase_ke_pangenome.genome` — Genome metadata (293K rows)
- `kbase_ke_pangenome.pangenome` — Per-species pangenome stats (27K rows)
- `kbase_ke_pangenome.gtdb_species_clade` — Taxonomy (27K rows)

**Output**: `../data/species_lifestyle_classification.csv`

In [1]:
spark = get_spark_session()

## 1. Table Row Counts

Verify expected table sizes before analysis.

In [2]:
tables = ['ncbi_env', 'genome', 'sample', 'pangenome', 'gtdb_species_clade']
for t in tables:
    cnt = spark.sql(f"SELECT COUNT(*) as cnt FROM kbase_ke_pangenome.{t}").collect()[0]['cnt']
    print(f"{t}: {cnt:,} rows")

ncbi_env: 4,124,801 rows
genome: 293,059 rows
sample: 293,059 rows
pangenome: 27,702 rows
gtdb_species_clade: 27,690 rows


## 2. Explore ncbi_env Structure

The `ncbi_env` table is EAV format: each row is one (BioSample, attribute_name, content) triple.
We need to find which `harmonized_name` values are most populated and useful for lifestyle classification.

In [3]:
# What harmonized_name values exist and how many biosamples have them?
attr_counts = spark.sql("""
    SELECT harmonized_name, COUNT(DISTINCT accession) as n_biosamples
    FROM kbase_ke_pangenome.ncbi_env
    GROUP BY harmonized_name
    ORDER BY n_biosamples DESC
    LIMIT 30
""")
attr_counts.show(30, truncate=False)

+--------------------+------------+
|harmonized_name     |n_biosamples|
+--------------------+------------+
|collection_date     |272680      |
|geo_loc_name        |271943      |
|isolation_source    |245344      |
|strain              |204424      |
|host                |170542      |
|NULL                |159029      |
|lat_lon             |137438      |
|sample_type         |111887      |
|env_broad_scale     |88251       |
|env_local_scale     |79794       |
|env_medium          |79613       |
|collected_by        |79233       |
|sample_name         |78793       |
|isolate             |64406       |
|host_disease        |52701       |
|metagenome_source   |50133       |
|project_name        |41136       |
|ref_biomaterial     |36478       |
|investigation_type  |31845       |
|derived_from        |30283       |
|host_health_state   |29616       |
|serovar             |28635       |
|isol_growth_condt   |28027       |
|num_replicons       |27607       |
|depth               |20088 

In [4]:
# Sample rows to understand the data format
spark.sql("""
    SELECT accession, harmonized_name, content
    FROM kbase_ke_pangenome.ncbi_env
    WHERE harmonized_name IN ('host', 'isolation_source', 'env_broad_scale', 'env_local_scale', 'env_medium')
    LIMIT 20
""").show(20, truncate=False)

+------------+----------------+----------------------------+
|accession   |harmonized_name |content                     |
+------------+----------------+----------------------------+
|SAMN14914897|env_broad_scale |marine biome [ENVO:00000447]|
|SAMN14914897|isolation_source|hypoxic seawater            |
|SAMN14914898|env_broad_scale |marine biome [ENVO:00000447]|
|SAMN14914898|isolation_source|hypoxic seawater            |
|SAMN14914900|env_broad_scale |marine biome [ENVO:00000447]|
|SAMN14914900|isolation_source|hypoxic seawater            |
|SAMN14914901|env_broad_scale |marine biome [ENVO:00000447]|
|SAMN14914901|isolation_source|hypoxic seawater            |
|SAMN14914903|env_broad_scale |marine biome [ENVO:00000447]|
|SAMN14914903|isolation_source|hypoxic seawater            |
|SAMN14914904|env_broad_scale |marine biome [ENVO:00000447]|
|SAMN14914904|isolation_source|hypoxic seawater            |
|SAMN14914905|env_broad_scale |marine biome [ENVO:00000447]|
|SAMN14914905|isolation_

## 3. Assess Genome-to-ncbi_env Join Coverage

How many of the 293K genomes have `ncbi_biosample_id` values that match `ncbi_env.accession`?

In [5]:
# Check how many genomes have non-null biosample IDs
spark.sql("""
    SELECT
        COUNT(*) as total_genomes,
        SUM(CASE WHEN ncbi_biosample_id IS NOT NULL AND ncbi_biosample_id != '' THEN 1 ELSE 0 END) as has_biosample,
        SUM(CASE WHEN ncbi_biosample_id IS NOT NULL AND ncbi_biosample_id != '' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as pct_with_biosample
    FROM kbase_ke_pangenome.genome
""").show()

+-------------+-------------+------------------+
|total_genomes|has_biosample|pct_with_biosample|
+-------------+-------------+------------------+
|       293059|       293059|100.00000000000000|
+-------------+-------------+------------------+



In [6]:
# Check how many genomes actually join to ncbi_env
spark.sql("""
    SELECT
        COUNT(DISTINCT g.genome_id) as genomes_with_env
    FROM kbase_ke_pangenome.genome g
    JOIN kbase_ke_pangenome.ncbi_env ne
        ON g.ncbi_biosample_id = ne.accession
""").show()

+----------------+
|genomes_with_env|
+----------------+
|          293050|
+----------------+



In [7]:
# How many genomes have lifestyle-relevant attributes specifically?
spark.sql("""
    SELECT
        ne.harmonized_name,
        COUNT(DISTINCT g.genome_id) as n_genomes
    FROM kbase_ke_pangenome.genome g
    JOIN kbase_ke_pangenome.ncbi_env ne
        ON g.ncbi_biosample_id = ne.accession
    WHERE ne.harmonized_name IN ('host', 'isolation_source', 'env_broad_scale', 'env_local_scale', 'env_medium')
    GROUP BY ne.harmonized_name
    ORDER BY n_genomes DESC
""").show()

+----------------+---------+
| harmonized_name|n_genomes|
+----------------+---------+
|isolation_source|   245717|
|            host|   170851|
| env_broad_scale|    88321|
| env_local_scale|    79863|
|      env_medium|    79682|
+----------------+---------+



## 4. Examine Content Values for Lifestyle Classification

For each lifestyle-relevant attribute, examine the most common content values.
This will help us build classification rules.

In [8]:
# Top 'host' values
spark.sql("""
    SELECT content, COUNT(*) as cnt
    FROM kbase_ke_pangenome.ncbi_env
    WHERE harmonized_name = 'host'
        AND content IS NOT NULL
        AND content != ''
        AND LOWER(content) NOT IN ('not applicable', 'not collected', 'missing', 'na', 'n/a', 'none', 'unknown')
    GROUP BY content
    ORDER BY cnt DESC
    LIMIT 30
""").show(30, truncate=False)

+------------------------+------+
|content                 |cnt   |
+------------------------+------+
|Homo sapiens            |104610|
|Gallus gallus           |2310  |
|Bos taurus              |2210  |
|Sus scrofa              |1457  |
|Mus musculus            |1422  |
|pig                     |1382  |
|Sus scrofa domesticus   |1351  |
|not available           |1244  |
|sheep                   |854   |
|goat                    |833   |
|Pig                     |711   |
|cattle                  |707   |
|chicken                 |663   |
|bovine                  |619   |
|Corn                    |619   |
|environmental           |580   |
|Soybean                 |575   |
|Cicer arietinum         |559   |
|Canis lupus familiaris  |443   |
|swine                   |441   |
|Ircinia ramosa          |435   |
|Arabidopsis thaliana    |383   |
|Equus ferus caballus    |373   |
|Gallus gallus domesticus|354   |
|roe deer                |341   |
|dairy cattle            |338   |
|dog          

In [9]:
# Top 'isolation_source' values
spark.sql("""
    SELECT content, COUNT(*) as cnt
    FROM kbase_ke_pangenome.ncbi_env
    WHERE harmonized_name = 'isolation_source'
        AND content IS NOT NULL
        AND content != ''
        AND LOWER(content) NOT IN ('not applicable', 'not collected', 'missing', 'na', 'n/a', 'none', 'unknown')
    GROUP BY content
    ORDER BY cnt DESC
    LIMIT 30
""").show(30, truncate=False)

+----------------------------------------------------------------+-----+
|content                                                         |cnt  |
+----------------------------------------------------------------+-----+
|blood                                                           |12167|
|feces                                                           |11996|
|sputum                                                          |6846 |
|lake water                                                      |6278 |
|human gut                                                       |5605 |
|stool                                                           |4965 |
|soil                                                            |4376 |
|nasopharynx                                                     |4138 |
|urine                                                           |3947 |
|human feces                                                     |3922 |
|cattle rumen                                      

In [10]:
# Top 'env_broad_scale' values
spark.sql("""
    SELECT content, COUNT(*) as cnt
    FROM kbase_ke_pangenome.ncbi_env
    WHERE harmonized_name = 'env_broad_scale'
        AND content IS NOT NULL
        AND content != ''
        AND LOWER(content) NOT IN ('not applicable', 'not collected', 'missing', 'na', 'n/a', 'none', 'unknown')
    GROUP BY content
    ORDER BY cnt DESC
    LIMIT 30
""").show(30, truncate=False)

+-------------------------------------------+----+
|content                                    |cnt |
+-------------------------------------------+----+
|lentic water body                          |6776|
|human-associated habitat                   |4404|
|terrestrial biome                          |3921|
|Bos taurus                                 |3793|
|marine                                     |2331|
|marine biome [ENVO:00000447]               |1885|
|ENVO:00000428                              |1776|
|human gut                                  |1292|
|aquatic                                    |1286|
|marine pelagic biome                       |1271|
|human                                      |1237|
|bodily fluid material biome [ENVO:02000019]|1069|
|host-associated habitat                    |1023|
|terrestrial biome [ENVO:00000446]          |975 |
|not provided                               |953 |
|Pig digestive system                       |934 |
|feces                         

## 5. Build Lifestyle Classifier

Classification rules (refine after inspecting content values above):

**Host-associated**: Genome has a valid `host` attribute (not 'not applicable', etc.) OR `isolation_source` contains host-related keywords (blood, sputum, wound, stool, urine, clinical, patient, etc.)

**Free-living**: Genome has NO valid `host` attribute AND `isolation_source` or `env_broad_scale` contains environmental keywords (soil, water, marine, freshwater, sediment, rhizosphere, etc.)

**Ambiguous/Unknown**: Everything else

In [11]:
# Extract all lifestyle-relevant metadata per genome
# Pivot the EAV table into a wide format with one row per genome
genome_env = spark.sql("""
    SELECT
        g.genome_id,
        g.gtdb_species_clade_id,
        MAX(CASE WHEN ne.harmonized_name = 'host' THEN ne.content END) as host,
        MAX(CASE WHEN ne.harmonized_name = 'isolation_source' THEN ne.content END) as isolation_source,
        MAX(CASE WHEN ne.harmonized_name = 'env_broad_scale' THEN ne.content END) as env_broad_scale,
        MAX(CASE WHEN ne.harmonized_name = 'env_local_scale' THEN ne.content END) as env_local_scale,
        MAX(CASE WHEN ne.harmonized_name = 'env_medium' THEN ne.content END) as env_medium
    FROM kbase_ke_pangenome.genome g
    JOIN kbase_ke_pangenome.ncbi_env ne
        ON g.ncbi_biosample_id = ne.accession
    WHERE ne.harmonized_name IN ('host', 'isolation_source', 'env_broad_scale', 'env_local_scale', 'env_medium')
    GROUP BY g.genome_id, g.gtdb_species_clade_id
""")

genome_env.cache()
print(f"Genomes with any lifestyle metadata: {genome_env.count():,}")
genome_env.show(10, truncate=50)

Genomes with any lifestyle metadata: 279,547
+------------------+--------------------------------------------------+--------------------+---------------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+
|         genome_id|                             gtdb_species_clade_id|                host|                             isolation_source|                       env_broad_scale|                       env_local_scale|                            env_medium|
+------------------+--------------------------------------------------+--------------------+---------------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+
|GB_GCA_000364545.1|    s__Fonsibacter_sp018882565--GB_GCA_018882565.1|                NULL|                                         NULL|not provided; submitted under MIGS 2.1|not provid

In [12]:
from pyspark.sql import functions as F

# Define classification rules
# NOTE: Refine these keyword lists after inspecting the content values above!

# Sentinel values to exclude
na_values = ['not applicable', 'not collected', 'missing', 'na', 'n/a', 'none', 'unknown', '']

# Host-related isolation source keywords (clinical/host samples)
host_keywords = ['blood', 'sputum', 'wound', 'stool', 'urine', 'clinical',
                 'patient', 'feces', 'abscess', 'throat', 'nasal', 'skin',
                 'lung', 'cerebrospinal', 'bile', 'tissue', 'biopsy',
                 'human', 'cattle', 'chicken', 'pig', 'mouse', 'bovine',
                 'swine', 'poultry', 'gut', 'intestin', 'fecal', 'rectal']

# Environmental isolation source keywords
env_keywords = ['soil', 'water', 'marine', 'freshwater', 'sediment', 'river',
                'lake', 'ocean', 'sea', 'rhizosphere', 'root', 'leaf',
                'air', 'hot spring', 'hydrothermal', 'mine', 'compost',
                'wastewater', 'activated sludge', 'biofilm', 'rock', 'sand']


def has_valid_host(host_col):
    """Returns True if host column has a meaningful value."""
    return (
        host_col.isNotNull()
        & (~F.lower(host_col).isin(na_values))
    )


def matches_keywords(col, keywords):
    """Returns True if column contains any of the keywords (case-insensitive)."""
    conditions = [F.lower(col).contains(kw) for kw in keywords]
    return F.when(col.isNull(), F.lit(False)).otherwise(
        conditions[0] if len(conditions) == 1
        else F.greatest(*[c.cast('int') for c in conditions]).cast('boolean')
    )


# Classify each genome
classified = genome_env.withColumn(
    'lifestyle',
    F.when(
        has_valid_host(F.col('host')) | matches_keywords(F.col('isolation_source'), host_keywords),
        F.lit('host_associated')
    ).when(
        matches_keywords(F.col('isolation_source'), env_keywords)
        | matches_keywords(F.col('env_broad_scale'), env_keywords),
        F.lit('free_living')
    ).otherwise(F.lit('ambiguous'))
)

# Summary
classified.groupBy('lifestyle').count().orderBy('count', ascending=False).show()

+---------------+------+
|      lifestyle| count|
+---------------+------+
|host_associated|180083|
|    free_living| 52145|
|      ambiguous| 47319|
+---------------+------+



In [13]:
# Spot-check: sample genomes from each category
for cat in ['host_associated', 'free_living', 'ambiguous']:
    print(f"\n=== {cat} (sample 5) ===")
    classified.filter(F.col('lifestyle') == cat).select(
        'genome_id', 'host', 'isolation_source', 'env_broad_scale'
    ).show(5, truncate=60)


=== host_associated (sample 5) ===
+------------------+--------------------+---------------------------------+----------------+
|         genome_id|                host|                 isolation_source| env_broad_scale|
+------------------+--------------------+---------------------------------+----------------+
|GB_GCA_000431755.1|                NULL|derived from human gut metagenome|            NULL|
|GB_GCA_000436235.1|                NULL|derived from human gut metagenome|            NULL|
|GB_GCA_001157725.1|        Homo sapiens|                        not known|            NULL|
|GB_GCA_001398295.1|        Homo sapiens|                         cervical|            NULL|
|GB_GCA_001422265.1|Arabidopsis thaliana|                             NULL|A. thaliana leaf|
+------------------+--------------------+---------------------------------+----------------+
only showing top 5 rows

=== free_living (sample 5) ===
+------------------+--------------+------------------------------------

## 6. Aggregate to Species Level

Assign each species a lifestyle based on majority vote of its constituent genomes.
Filter to species with:
- >= 10 genomes total (for meaningful pangenome)
- >= 70% of classified genomes agreeing on lifestyle (clear assignment)

In [14]:
# Count lifestyle categories per species
species_lifestyle = classified.filter(
    F.col('lifestyle') != 'ambiguous'
).groupBy('gtdb_species_clade_id', 'lifestyle').count()

# Get total classified genomes per species
from pyspark.sql.window import Window

w = Window.partitionBy('gtdb_species_clade_id')
species_lifestyle = species_lifestyle.withColumn(
    'total_classified', F.sum('count').over(w)
).withColumn(
    'fraction', F.col('count') / F.col('total_classified')
)

species_lifestyle.show(20, truncate=False)

+--------------------------------------------------+-----------+-----+----------------+--------+
|gtdb_species_clade_id                             |lifestyle  |count|total_classified|fraction|
+--------------------------------------------------+-----------+-----+----------------+--------+
|s__0-14-0-80-60-11_sp018897875--GB_GCA_018897875.1|free_living|1    |1               |1.0     |
|s__0-14-3-00-41-53_sp002780895--GB_GCA_002780895.1|free_living|6    |6               |1.0     |
|s__1-14-0-10-32-24_sp018655525--GB_GCA_018655525.1|free_living|4    |4               |1.0     |
|s__1-14-0-10-37-14_sp002778705--GB_GCA_002778705.1|free_living|1    |1               |1.0     |
|s__1-14-0-10-45-20_sp007378985--GB_GCA_007378985.1|free_living|3    |3               |1.0     |
|s__1-14-0-10-47-16_sp002773395--GB_GCA_002773395.1|free_living|1    |1               |1.0     |
|s__1-14-0-20-42-23_sp002796345--GB_GCA_002796345.1|free_living|3    |3               |1.0     |
|s__1-14-0-20-46-22_sp00279624

In [15]:
# Assign species lifestyle: majority vote with >= 70% threshold
# Keep only the dominant lifestyle per species
w_rank = Window.partitionBy('gtdb_species_clade_id').orderBy(F.desc('count'))

species_assignment = species_lifestyle.withColumn(
    'rank', F.row_number().over(w_rank)
).filter(
    (F.col('rank') == 1) & (F.col('fraction') >= 0.7)
).select(
    'gtdb_species_clade_id',
    F.col('lifestyle').alias('species_lifestyle'),
    F.col('count').alias('n_classified_dominant'),
    'total_classified',
    F.col('fraction').alias('lifestyle_fraction')
)

print(f"Species with clear lifestyle assignment: {species_assignment.count():,}")
species_assignment.groupBy('species_lifestyle').count().show()

Species with clear lifestyle assignment: 23,900
+-----------------+-----+
|species_lifestyle|count|
+-----------------+-----+
|      free_living|12367|
|  host_associated|11533|
+-----------------+-----+



In [16]:
# Join with pangenome stats to filter by genome count and add taxonomy
target_species = species_assignment.join(
    spark.sql("""
        SELECT p.gtdb_species_clade_id, p.no_genomes, p.no_gene_clusters, p.no_core,
               p.no_aux_genome, p.no_singleton_gene_clusters,
               sc.GTDB_taxonomy, sc.GTDB_species
        FROM kbase_ke_pangenome.pangenome p
        JOIN kbase_ke_pangenome.gtdb_species_clade sc
            ON p.gtdb_species_clade_id = sc.gtdb_species_clade_id
    """),
    on='gtdb_species_clade_id',
    how='inner'
).filter(
    F.col('no_genomes') >= 10
)

# Add phylum for phylogenetic stratification
target_species = target_species.withColumn(
    'phylum', F.split(F.col('GTDB_taxonomy'), ';').getItem(1)
)

print(f"Target species (>= 10 genomes, clear lifestyle): {target_species.count():,}")
target_species.groupBy('species_lifestyle').count().show()
target_species.groupBy('species_lifestyle', 'phylum').count().orderBy(
    'species_lifestyle', F.desc('count')
).show(30, truncate=False)

Target species (>= 10 genomes, clear lifestyle): 2,529
+-----------------+-----+
|species_lifestyle|count|
+-----------------+-----+
|      free_living|  824|
|  host_associated| 1705|
+-----------------+-----+

+-----------------+---------------------+-----+
|species_lifestyle|phylum               |count|
+-----------------+---------------------+-----+
|free_living      |p__Pseudomonadota    |348  |
|free_living      |p__Actinomycetota    |98   |
|free_living      |p__Bacteroidota      |80   |
|free_living      |p__Bacillota         |50   |
|free_living      |p__Cyanobacteriota   |39   |
|free_living      |p__Patescibacteria   |31   |
|free_living      |p__Chloroflexota     |22   |
|free_living      |p__Thermoproteota    |17   |
|free_living      |p__Verrucomicrobiota |16   |
|free_living      |p__Halobacteriota    |12   |
|free_living      |p__Thermoplasmatota  |11   |
|free_living      |p__Marinisomatota    |9    |
|free_living      |p__Desulfobacterota  |9    |
|free_living      |p

In [17]:
# Save species lifestyle classification
target_pdf = target_species.toPandas()
target_pdf.to_csv('../data/species_lifestyle_classification.csv', index=False)
print(f"Saved {len(target_pdf)} species to ../data/species_lifestyle_classification.csv")
target_pdf.head(10)

Saved 2529 species to ../data/species_lifestyle_classification.csv


Unnamed: 0,gtdb_species_clade_id,species_lifestyle,n_classified_dominant,total_classified,lifestyle_fraction,no_genomes,no_gene_clusters,no_core,no_aux_genome,no_singleton_gene_clusters,GTDB_taxonomy,GTDB_species,phylum
0,s__Klebsiella_pneumoniae--RS_GCF_000742135.1,host_associated,12776,13047,0.979229,14240,443124,4199,438925,276743.0,d__Bacteria;p__Pseudomonadota;c__Gammaproteoba...,s__Klebsiella_pneumoniae,p__Pseudomonadota
1,s__Staphylococcus_aureus--RS_GCF_001027105.1,host_associated,12785,12863,0.993936,14526,147914,2083,145831,86127.0,d__Bacteria;p__Bacillota;c__Bacilli;o__Staphyl...,s__Staphylococcus_aureus,p__Bacillota
2,s__Salmonella_enterica--RS_GCF_000006945.2,host_associated,6956,7526,0.924263,11402,266371,3639,262732,149757.0,d__Bacteria;p__Pseudomonadota;c__Gammaproteoba...,s__Salmonella_enterica,p__Pseudomonadota
3,s__Streptococcus_pneumoniae--RS_GCF_001457635.1,host_associated,7935,7935,1.0,8434,116845,1475,115370,67191.0,d__Bacteria;p__Bacillota;c__Bacilli;o__Lactoba...,s__Streptococcus_pneumoniae,p__Bacillota
4,s__Mycobacterium_tuberculosis--RS_GCF_000195955.2,host_associated,6638,6638,1.0,6903,143670,3741,139929,97549.0,d__Bacteria;p__Actinomycetota;c__Actinomycetia...,s__Mycobacterium_tuberculosis,p__Actinomycetota
5,s__Pseudomonas_aeruginosa--RS_GCF_001457615.1,host_associated,4455,4825,0.923316,6760,256093,5199,250894,120728.0,d__Bacteria;p__Pseudomonadota;c__Gammaproteoba...,s__Pseudomonas_aeruginosa,p__Pseudomonadota
6,s__Acinetobacter_baumannii--RS_GCF_009759685.1,host_associated,5749,5776,0.995325,6647,211211,2957,208254,131031.0,d__Bacteria;p__Pseudomonadota;c__Gammaproteoba...,s__Acinetobacter_baumannii,p__Pseudomonadota
7,s__Clostridioides_difficile--RS_GCF_001077535.1,host_associated,1819,1838,0.989663,2604,98935,3078,95857,55410.0,d__Bacteria;p__Bacillota_A;c__Clostridia;o__Pe...,s__Clostridioides_difficile,p__Bacillota_A
8,s__Enterococcus_B_faecium--RS_GCF_001544255.1,host_associated,2056,2097,0.980448,2533,75395,1971,73424,39204.0,d__Bacteria;p__Bacillota;c__Bacilli;o__Lactoba...,s__Enterococcus_B_faecium,p__Bacillota
9,s__Enterobacter_hormaechei_A--RS_GCF_001729745.1,host_associated,2173,2235,0.97226,2453,157037,3574,153463,95270.0,d__Bacteria;p__Pseudomonadota;c__Gammaproteoba...,s__Enterobacter_hormaechei_A,p__Pseudomonadota


## Findings

Record observations here after running:

- How many genomes had usable lifestyle metadata? ___
- How many species classified as host-associated? ___
- How many species classified as free-living? ___
- Which phyla are represented in each category? ___
- Any concerns about classification quality? ___
- Are the keyword lists adequate or do they need refinement? ___