
## "Unlocking Kenya's Land Treasure: The Structured Data Extraction Challenge"

* Welcome to our organization's first internal competition, where we embark on a journey to unlock the hidden potential within Kenya's gazzetes land registration archives. In this challenge, you will delve into the depths of unstructured data contained within Kenyan gazettes, extracting crucial information such as holder names, registration numbers, and precise land locations.

* The objective is clear: transform chaotic data into structured goldmines of information, paving the way for enhanced analytics, informed decision-making, and innovative solutions in various sectors. By harnessing the power of structured data, we aim to revolutionize land management practices, drive economic growth, and foster sustainable development across Kenya.



In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


### Train Data
* Within the provided internal competition folder, there are "train" and "test" directories.
* The "train" directory encompasses all the necessary data for training your AI models.
* The CSV files within represent the extracted information from the PDF files with corresponding filenames.
* The "Train.csv" file amalgamates all the individual train PDF files' CSVs.

In [None]:
import pandas as pd


In [None]:
train = pd.read_csv("/content/gdrive/MyDrive/internal_competition/train/Train.csv")

In [None]:
display(" Peak into the data",
        train.head(),
        " train's shape",
        train.shape)

' Peak into the data'

Unnamed: 0,filename,page number,gazzete notice number,name of the holder,Registration numbers,Land location
0,2008_VOL4,14,43,Syprose Helida Odero,Kisumu/Ojolla/418,Kisumu
1,2008_VOL4,14,44,David Ndungu Muchiiri,Gilgil/Gilgil Block 1/500,district of Nakuru
2,2008_VOL4,14,45,Njuguna Ngamau,Shawa/Rongai Block 3/56(Sachangwan),district of Nakuru
3,2008_VOL4,14,46,Mary Njeri Nganga,Dundori/Lanet Block 13/1,district of Nakuru
4,2008_VOL4,14,47,Charles Muteru Wambugu,Gilgil/Cilgil Block 1/375,district of Nakuru


" train's shape"

(649, 6)

* Filename: Name of the PDF from which the information has been obtained.
* Page number: The page number where the gazette notice is located, for confirmation and correcting any misrepresented information.
* Name of the holder: Holder of the land mentioned. This is one of the attributes we will be extracting.
* Registration Numbers: Any LR No, IR No, CR No, or Title No mentioned in the gazette notice. If two or more are present, they are comma-separated values and also to be extracted.
* Land location: The location where the land is situated. Also to be extracted.


In [None]:
train[(train['filename'].isin(['2023_VOL168'])) & (train['gazzete notice number'].isin([9638]))]


Unnamed: 0,filename,page number,gazzete notice number,name of the holder,Registration numbers,Land location
560,2023_VOL168,3156,9638,Muturi Wanaina Muturi,Sosian/Sosian Block 2/2117 (Narock Ranch),


### Test Data
* The "test" directory encompasses all the necessary data for performing inference using your trained AI models.
* The CSV files within represent the extracted information from the PDF files with corresponding filenames, but with the columns "Name of the Holder," "Registration Numbers," and "Land Location" left empty, as these are the fields we are aiming to extract.
* The "Test.csv" file amalgamates all the individual test PDF files' CSVs.

In [None]:
test = pd.read_csv("/content/gdrive/MyDrive/internal_competition/test/Test.csv")
display(" Peak into the data",
        test.head(),
        " test's shape",
        test.shape)

' Peak into the data'

Unnamed: 0,filename,gazzete notice number,name of the holder,Registration numbers,Land location
0,2023_VOL30,1430,,,
1,2023_VOL30,1431,,,
2,2023_VOL30,1432,,,
3,2023_VOL30,1433,,,
4,2023_VOL30,1434,,,


" test's shape"

(374, 5)

### Sample Submission
* The submission file is a melted version of the Test.csv
* What does this mean? Let us first take a look at the provided sample submission

In [None]:
sample_sub = pd.read_csv("/content/gdrive/MyDrive/internal_competition/sample_submission.csv")
display(sample_sub.head(), sample_sub.shape)

Unnamed: 0,id,pred
0,2023_VOL30_1430_name of the holder,
1,2023_VOL30_1431_name of the holder,
2,2023_VOL30_1432_name of the holder,
3,2023_VOL30_1433_name of the holder,
4,2023_VOL30_1434_name of the holder,


(1122, 2)

Melting is a concept in data reshaping where multiple columns, typically representing different variables, are transformed into a single column, while maintaining the relationships between them. In our context, melting involves pivoting the data from a wide format, where each observation has multiple columns, to a long format, where each observation is represented in multiple rows. This allows for easier evaluation as we will see later on

* The shape has now changed from 374 rows to 1122 rows, with the "id" column representing the filename_gazzettenoticenumber_variabletobepredicted.
* Let's explore the transformation from wide format to long format using our training dataframe.
* This conversion is crucial for ensuring a smooth evaluation phase.

### Melting manipulation on our train dataframe:
* This is the same process you will follow to transform your test df from wide to long format

In [None]:
train.head()

Unnamed: 0,filename,page number,gazzete notice number,name of the holder,Registration numbers,Land location
0,2008_VOL4,14,43,Syprose Helida Odero,Kisumu/Ojolla/418,Kisumu
1,2008_VOL4,14,44,David Ndungu Muchiiri,Gilgil/Gilgil Block 1/500,district of Nakuru
2,2008_VOL4,14,45,Njuguna Ngamau,Shawa/Rongai Block 3/56(Sachangwan),district of Nakuru
3,2008_VOL4,14,46,Mary Njeri Nganga,Dundori/Lanet Block 13/1,district of Nakuru
4,2008_VOL4,14,47,Charles Muteru Wambugu,Gilgil/Cilgil Block 1/375,district of Nakuru


In [None]:
# Lets create a unique id by concatenating the filename and gazette notice number
train['id'] = train['filename'] + '_' + train['gazzete notice number'].astype('str')
# lets drop the filename and gazzete notice number as they are represented by the id column and the page number as that is only used for you personal gazzette details confirmation
train.drop(columns = ['filename', 'gazzete notice number', 'page number'], inplace = True)
train.head()

Unnamed: 0,name of the holder,Registration numbers,Land location,id
0,Syprose Helida Odero,Kisumu/Ojolla/418,Kisumu,2008_VOL4_43
1,David Ndungu Muchiiri,Gilgil/Gilgil Block 1/500,district of Nakuru,2008_VOL4_44
2,Njuguna Ngamau,Shawa/Rongai Block 3/56(Sachangwan),district of Nakuru,2008_VOL4_45
3,Mary Njeri Nganga,Dundori/Lanet Block 13/1,district of Nakuru,2008_VOL4_46
4,Charles Muteru Wambugu,Gilgil/Cilgil Block 1/375,district of Nakuru,2008_VOL4_47


In [None]:
# Now we melt the train df
melted_train = pd.melt(train, id_vars=['id'], var_name = 'Variable', value_name= 'pred')
melted_train.head()

Unnamed: 0,id,Variable,pred
0,2008_VOL4_43,name of the holder,Syprose Helida Odero
1,2008_VOL4_44,name of the holder,David Ndungu Muchiiri
2,2008_VOL4_45,name of the holder,Njuguna Ngamau
3,2008_VOL4_46,name of the holder,Mary Njeri Nganga
4,2008_VOL4_47,name of the holder,Charles Muteru Wambugu


In [None]:
#now if we look at the Variable column and find the unique categories then we will find the three things we need to extract
melted_train.Variable.unique()

array(['name of the holder', 'Registration numbers', 'Land location'],
      dtype=object)

In [None]:
#great for a fully unique id, we now concat the id with the variable column and drop it since it is representedd by the id column now
melted_train['id'] = melted_train['id'] + '_' + melted_train['Variable']
melted_train.drop(columns =['Variable'], inplace = True)
melted_train.head()


Unnamed: 0,id,pred
0,2008_VOL4_43_name of the holder,Syprose Helida Odero
1,2008_VOL4_44_name of the holder,David Ndungu Muchiiri
2,2008_VOL4_45_name of the holder,Njuguna Ngamau
3,2008_VOL4_46_name of the holder,Mary Njeri Nganga
4,2008_VOL4_47_name of the holder,Charles Muteru Wambugu


In [None]:
# great and now we have a melted train that resembles the sample submission. Lets check the shape of the original train and the melted one
train.shape, melted_train.shape

((649, 4), (1947, 2))

In [None]:
"""great now you need to do the same for your test df after you have performed inference and you will get all the unique ids in the sample submission but now with the pred column with actual
predictions
The pred column in the below will now be filled after infferencing
"""
sample_sub.head()

Unnamed: 0,id,pred
0,2023_VOL30_1430_name of the holder,
1,2023_VOL30_1431_name of the holder,
2,2023_VOL30_1432_name of the holder,
3,2023_VOL30_1433_name of the holder,
4,2023_VOL30_1434_name of the holder,


## Evaluation process
* The evaluation metric used is accuracy.
* Let me demonstrate it below using a custom accuracy score calculation method


* The below function takes a name string as input and performs several preprocessing steps to ensure consistency in comparisons.

  * a. Lowercase Conversion: It converts the entire name to lowercase to remove any case sensitivity. This ensures that names like "Mary" and "mary" are treated as the same.

  * b. Whitespace Removal: It removes any extra whitespace characters (such as multiple spaces) to standardize the spacing between words. This ensures that variations in spacing don't affect the comparison.

  * c. Special Character Removal: It removes any special characters or punctuation marks (except comma for comma separated values) from the name. This ensures that variations in punctuation or special characters don't affect the comparison.

  * Example Usage: The function is applied to a list of names, and the resulting preprocessed names are stored in another list. These preprocessed names can then be used for comparison or evaluation purposes.

In [None]:
import re

def preprocess_name(name):
    try:
        name = str(name)
        # Convert to lowercase and remove extra whitespaces
        name = re.sub(r'\s+', '', name.lower().strip())

        # Handle names separated by comma
        if ',' in name:
            names = name.split(',')
            names = [re.sub(r'[^a-zA-Z0-9\s/]', '', n) for n in names]
            return ','.join(names)

        # Remove special characters
        name = re.sub(r'[^a-zA-Z0-9\s/]', '', name)
        return name
    except:
        return name

# Example usage: output should be a preprocessed output of the input
names = ["Syprose Helida Odero", "syprose helida odero", "SyproseHelidaOdero", "Syprose Helida Odero", "Mary Wanjiku Ndungu,Duncan Wachira Wanjau", "Sosian/Sosian Block 2/2117 (Narock Ranch)", "Evurore/Kathera/2061"]
preprocessed_names = [preprocess_name(name) for name in names]
print(preprocessed_names)


In [None]:
#Now we calcualate the accuracy score
from sklearn.metrics import accuracy_score

# Example arrays of predictions and actual values
predictions = ["Syprose Helida Odero", "syprose helida odero", "SyproseHelidaOdero", "Syprose Helida Odero", "Mary Wanjiku Ndungu,Duncan Wachira Wanjau"]
actual_values = ["Syprose Helida Odero", "SyproseHelidaOdero", "syprose helida odero", "Michael Kimani Macharia", "Mary Wanjiku Ndungu,Duncan Wachira Wanjau"]

# Preprocess predictions and actual values
preprocessed_predictions = [preprocess_name(name) for name in predictions]
preprocessed_actual_values = [preprocess_name(name) for name in actual_values]

# Calculate accuracy score using preprocessed arrays
accuracy = accuracy_score(preprocessed_actual_values, preprocessed_predictions)

print("Accuracy Score:", accuracy)

Accuracy Score: 0.8


* We get an accuracy score of 0.8 despite some of them having different cases, spaces etc nicee!!!
* it is not 1 due the instance of "Syprose Helida Odero" and "Michael Kimani Macharia"
* Also for instances where we have comma separated values make sure you record every one of them e.g for the holders names since there may be multiple people and also the registration numbers since there might be a mention of both LR NO and IR NO

#### So key take aways in this evaluation phase
* Make sure you use the above provided custom preprocess function to modify your preds before downloading your sample submission values
* The actuall labels in the competition have been processed using the function above so make sure you preprocess your outputs before downloading your sub file
* I will show you an example using melted train

In [None]:
melted_train.head()

Unnamed: 0,id,pred
0,2008_VOL4_43_name of the holder,Syprose Helida Odero
1,2008_VOL4_44_name of the holder,David Ndungu Muchiiri
2,2008_VOL4_45_name of the holder,Njuguna Ngamau
3,2008_VOL4_46_name of the holder,Mary Njeri Nganga
4,2008_VOL4_47_name of the holder,Charles Muteru Wambugu


In [None]:
# The above pred column , we will change it to actual and then create a copy of it and name it actual
melted_train.rename(columns = {'pred': 'actual'}, inplace = True)
melted_train['pred'] = melted_train['actual'].copy()
melted_train.head()

Unnamed: 0,id,actual,pred
0,2008_VOL4_43_name of the holder,Syprose Helida Odero,Syprose Helida Odero
1,2008_VOL4_44_name of the holder,David Ndungu Muchiiri,David Ndungu Muchiiri
2,2008_VOL4_45_name of the holder,Njuguna Ngamau,Njuguna Ngamau
3,2008_VOL4_46_name of the holder,Mary Njeri Nganga,Mary Njeri Nganga
4,2008_VOL4_47_name of the holder,Charles Muteru Wambugu,Charles Muteru Wambugu


In [None]:
#notice that the current actual and preds have not been processed yet: but remember the one on the leaderboard has been processed by the same function explained above
melted_train['actual'] = melted_train['actual'].apply(lambda x: preprocess_name(x))
melted_train['pred'] = melted_train['pred'].apply(lambda x: preprocess_name(x))

melted_train.head()


Unnamed: 0,id,actual,pred
0,2008_VOL4_43_name of the holder,syprosehelidaodero,syprosehelidaodero
1,2008_VOL4_44_name of the holder,davidndungumuchiiri,davidndungumuchiiri
2,2008_VOL4_45_name of the holder,njugunangamau,njugunangamau
3,2008_VOL4_46_name of the holder,marynjeringanga,marynjeringanga
4,2008_VOL4_47_name of the holder,charlesmuteruwambugu,charlesmuteruwambugu


Nice notice how nicely they have been formatted?
Okay calculating the accuracy will be pretty straightforward now

In [None]:
accuracy_score(melted_train['actual'], melted_train['pred'])

1.0

* Okay we are getting an accuracy score of 100% but thats because we just copied the preds from the actual. But if your AI model is good enough you will can achieve that too
* But for you you wont have the actuall, you will only have the pred in your submission file
* Your submission file should have the same exact format( columns) as the sample submission



In [None]:
sample_sub.head()

Unnamed: 0,id,pred
0,2023_VOL30_1430_name of the holder,
1,2023_VOL30_1431_name of the holder,
2,2023_VOL30_1432_name of the holder,
3,2023_VOL30_1433_name of the holder,
4,2023_VOL30_1434_name of the holder,


* So you will apply the preprocessing function to the pred column after performing inference then download the file


In [None]:
from google.colab import files
sample_sub['pred'] = sample_sub['pred'].apply(lambda x: preprocess_name(x))
sample_sub.to_csv("baseline.csv", index= False)
files.download('/content/baseline.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* Then get the downloaded csv file, go and submit it in the competitions tab on huggingface
