In [2]:
import requests

url = 'https://kata.geosci.ai/challenge/sample-names'
r = requests.get(url)
print(r.text)


# Sample names

You have a set of sample names. They look like this:

    001235_Ainsa_Sobrarbe_C_2016-04-20_PCx
    ^^^^^^ ^^^^^ ^^^^^^^^ ^ ^^^^^^^^^^ ^^^
      1      2      3     4      5      6

A **valid name** consists of 6 parts separated by underscores. The parts are underlined, above. Note that the parts might not be correct or consistent. Having 6 parts, whether they are correct or not, is enough to be called 'valid'. There may be other problems, for example with the spelling or formatting of individual parts, but we will still call it 'valid'.

The 6 parts are:

- **Unique identifier** consisting of 6 characters.
- **Basin name.** Note that spellings are not guaranteed to be correct.
- **Unit or Formation name.** Note that spellings are not guaranteed to be correct.
- **Specimen type**, either H or C (hand or core).
- **Date**, which must be in ISO 8601 YYYY-MM-DD format to be considered correct.
- **Preparation codes** of at least one character.

We need to extract some inf

In [60]:
key = 'grajohnt'
qurl = 'https://kata.geosci.ai/challenge/sample-names?key='+key  # <--- In week 2, you'll change the name
r = requests.get(qurl)
samples = r.text
print(samples)

000190_Tremp_Tremp_H_2000-01-02_PT
000191_Tremp_Tremp_H_2000-01-02_C
000192_Tremp_Tremp_H_2000-01-04_x
000193_Tremp_Tremp_2000-01-05_xM
000195_Tremp_Pasarela_C_2000-01-05_Px
000197_Jaca_Morillo_H_05-01-00_T
000198_Jaca_Morillo_H_2000-01-05_PxM
000199_Jaca_Banaston_H_2000-01-05_PM
000200_Jcaa_Banaston_H_2000-01-07_C
000201_Jaca_banatson_H_2000-01-07_TC
000203_Jaca_Gerbe_C_2000-01-08_Px
000204_Ainsa_Escanilla_C_2000-01-08_TCM
000205_Ainsa_Sobrarbe_H_2000-01-09_x
000208_Ainsa_Escanilla_H_2000-01-09_Tx
000209_Ainsa_Escanilla_H_2000-01-10_PCx
000210_Ainsa_ESCANILLA_H_2000-01-11_PTC
000211_Ainsa_ESCANILLA_C_2000-01-12_PTx
000212_Ainsa_ESCANILLA_H_2000-01-13_Tx
000213_Ainsa_ESCNAILLA_H_2000-01-13_CxM
000214_Ainsa_ESCANILLA_C_2000-01-13_M
000216_Ainsa_ESCAINLLA_H_2000-01-15_M
000217_Anisa_ESCANILLA_H_2000-01-15_TC
000218_Ainsa_ESCANILLA_H_2000-01-18_C
000219_Ainsa_ESCANILLA_H_2000-01-19_C
000220_ainsa_ESCANILLA_H_2000-01-20_PC
000221_ainsa_ESCANILLA_C_2000-01-20_xM
000222_ainsa_ESCANILLA_H_200

In [61]:
question = 1
#1. How many valid sample names are there?

import regex #oh look - more regex - surprise!
s = regex.findall("(\d+)_(.+)_(.+)_(.+)_(.+)_(.+)",samples)

len(s) # If a sample is valid, it should be in s


9184

In [18]:
# Submit

aurl = qurl+'&question='+str(question)+'&answer='+str(len(s))
print(aurl)
r = requests.get(aurl)
print(r.text)

https://kata.geosci.ai/challenge/sample-names?key=grajohnt&question=1&answer=9184
Correct


In [34]:
question = 2
#2. How many valid samples were taken in the Ainsa basin? Include records with misspelt basin names.

ac = 0
for line in s:
    # Another regex match to pick up all things that start/end with A and have some variation of 'ins' in the middle
    #  Note that this only reads from the second column of data (which contains the basin name)
    if regex.findall("[Aa][insINS]{3}[Aa]",line[1]):
        ac = ac+1

print(ac)        

1949


In [37]:
aurl = qurl+'&question='+str(question)+'&answer='+str(ac)
print(aurl)
r = requests.get(aurl)
print(r.text)


https://kata.geosci.ai/challenge/sample-names?key=grajohnt&question=2&answer=1949
Correct


In [72]:
question = 3
#3. What's the longest period of days with no valid samples taken in Ainsa?

# Use datetime as suggested...
from datetime import datetime

# Reset/define the variables
lastdate = None
td = None
maxtd = 0
lastline = 0

for line in s:
    # Use the regex match from question 2
    if regex.findall("[Aa][insINS]{3}[Aa]",line[1]):
        # Read the date from column 5, trapping for invalid dates
        try: sdate = datetime.fromisoformat(line[4])
        except: continue
        
        # Only for the first date on the list
        if lastdate is None:
            lastdate = sdate
            
        # Find the gap in time, and see if it's the longest
        else:
            td = sdate - lastdate   
            if td.days > maxtd: maxtd = td.days
        lastdate = sdate
        
# Note that this is the number of days between, but *including* the start date!
#  We must therefore subtract one from this value to get the number of days *between* samples.

print(str(maxtd))
print('Number of days between samples = ' + str(maxtd-1))        

294
Number of days between samples = 293


In [68]:
aurl = qurl+'&question='+str(question)+'&answer='+str(maxtd-1)
print(aurl)
r = requests.get(aurl)
print(r.text)

https://kata.geosci.ai/challenge/sample-names?key=grajohnt&question=3&answer=293
Correct! The next challenge is: https://kata.geosci.ai/challenge/prospecting - good luck!
