<a href="https://colab.research.google.com/github/pnadelofficial/hf_classification_hackthon2_12/blob/main/Classification_with_HuggingGace_transformers_Hackathon2_12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification with HuggingFace `transformers`

In this notebook, we are tasked to find and replace misspelled or misidentified placenames in geographic data. In addition to the placenames, we also have scores that come from a fuzzymatching process, which will give us a starting place for building our model.


A few notes on running this yourself:
1. It requires a GPU. You can access time on a GPU for free in Colab by going to the `Runtime` tab and selecting `Change runtime type`. Then find `GPU` in the `Hardware accelerator` drop down and hit `Save`. 
2. I tried to make it so that there were `pip install`s for all of the software packages we are using, but if you see an error it may be a versions or installation issue. 

In [None]:
!pip install transformers sentencepiece datasets -U

In [None]:
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict
from transformers import AutoModelForSequenceClassification,AutoTokenizer,TrainingArguments,Trainer

## Data prep

In [None]:
# your CSV from the Box folder (much better than the double csv)
match_adm = pd.read_csv('/content/MATCH Fuzzy Matching.csv')
match_adm

Unnamed: 0.1,Unnamed: 0,target_adm,SURVEYID,MATCHED_ADM,score_sort,score_partial,ADM_012,GID_2
0,0,afghanistan - daykundi - ishtarlay,afgh38.csv,afghanistan - daykundi - shahristan,85,54,afghanistan - daykundi - shahristan,AFG.6.4_1
1,0,afghanistan - laghman - mihtariam,afgh39.csv,afghanistan - laghman - mihtarlam,97,205,afghanistan - laghman - mihtarlam,AFG.20.4_1
2,0,afghanistan - laghman - qarghayi,afgh39.csv,afghanistan - laghman - qarghayi,100,206,afghanistan - laghman - qarghayi,AFG.20.5_1
3,0,afghanistan - nangarhar - behsud,afgh40.csv,afghanistan - nangarhar - hisarak,84,219,afghanistan - nangarhar - hisarak,AFG.22.8_1
4,0,afghanistan - nangarhar - kama rodat,afgh40.csv,afghanistan - nangarhar - rodat,92,228,afghanistan - nangarhar - rodat,AFG.22.17_1
...,...,...,...,...,...,...,...,...
3418,0,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355093,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2
3419,25,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355094,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2
3420,50,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355095,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2
3421,75,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355096,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2


In [None]:
match_adm['input'] = match_adm.apply(lambda x: f"TEXT1: {x['target_adm']}; TEXT2: {x['ADM_012']}",axis=1) # can change this
match_adm['labels'] = match_adm['score_sort'].apply(lambda x: 1 if x>85 else 0) # can change this threshold (related to the fuzzy matching)
match_adm['labels'] = match_adm['labels'].astype(float)
data = match_adm[['input','labels']].sample(len(match_adm)//3) # can change this (how much of the original data we are using to train)

In [None]:
data

Unnamed: 0,input,labels
2052,TEXT1: kenya - makueni - malili; TEXT2: kenya ...,0.0
2343,TEXT1: nepal - eastern - kosi; TEXT2: nepal - ...,1.0
3216,TEXT1: togo - maritime - golfe; TEXT2: togo - ...,1.0
1246,TEXT1: ethiopia - somali - liben; TEXT2: ethio...,1.0
545,TEXT1: cote divoire - woroba - bafing; TEXT2: ...,1.0
...,...,...
2749,TEXT1: somalia - lower shebelle - merca; TEXT2...,0.0
2038,TEXT1: kenya - makueni - nguu; TEXT2: kenya - ...,0.0
2460,TEXT1: pakistan - sindh - thatta; TEXT2: pakis...,0.0
3420,TEXT1: zimbabwe - mashonaland central - rushin...,1.0


In [None]:
ds = Dataset.from_pandas(data)
ds

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[K     |████████████████████████████████| 452 kB 26.6 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 31.6 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 86.3 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 72.0 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 86.5 MB/s 
Installing collected packag

Dataset({
    features: ['input', 'labels', '__index_level_0__'],
    num_rows: 1141
})

## Metrics and model training

In [None]:
model_nm = 'microsoft/deberta-v3-small' # simple model, easy to download and run
tokz = AutoTokenizer.from_pretrained(model_nm) # need tokenizer

def tok_func(x): 
  return tokz(x["input"])

tok_ds = ds.map(tok_func, batched=True) # tokenization

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 34.5 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 52.9 MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 32.2 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97


Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/578 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
# hf datasets splits our data into train, test (validation), and evaluation (test) sets
tts = tok_ds.train_test_split(0.25, seed=42)
tts

DatasetDict({
    train: Dataset({
        features: ['input', 'labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 855
    })
    test: Dataset({
        features: ['input', 'labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 286
    })
})

In [None]:
def corr_d(eval_pred): 
    return {'pearson': np.corrcoef(*eval_pred)[0][1]} # using r value for metrics
# can read more on r here: https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php

In [None]:
# feel free to play around here to find out what works best

# can change these to tune results
bs = 128
epochs = 10
lr = 6e-5 

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

PyTorch: setting up devices


In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=tts['train'], eval_dataset=tts['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--deberta-v3-small/snapshots/a36c739020e01763fe789b4b85e2df55d6180012/config.json
Model config DebertaV2Config {
  "_name_or_path": "microsoft/deberta-v3-small",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": [
    "p2c",
    "c2p"
  ],
  "position_biased_input": false,
  "position_buckets": 256,
  "relative_attention": true,
  "share_att_key": true,
  "transform

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: __index_level_0__, input. If __index_level_0__, input are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 855
  Num Epochs = 10
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 70
  Number of trainable parameters = 141895681


Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.347131,0.475561
2,No log,0.217163,0.768025
3,No log,0.084177,0.875039
4,No log,0.07278,0.888098
5,No log,0.063761,0.915702
6,No log,0.0462,0.917265
7,No log,0.037412,0.933936
8,No log,0.038537,0.934757
9,No log,0.039779,0.9312
10,No log,0.038833,0.932376


The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: __index_level_0__, input. If __index_level_0__, input are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 286
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: __index_level_0__, input. If __index_level_0__, input are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 286
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: __index_level_0__, input. If __index_level_0__, input are not expected by `DebertaV2ForSequ

TrainOutput(global_step=70, training_loss=0.11920833587646484, metrics={'train_runtime': 20.8462, 'train_samples_per_second': 410.148, 'train_steps_per_second': 3.358, 'total_flos': 94499576467560.0, 'train_loss': 0.11920833587646484, 'epoch': 10.0})

In [None]:
trainer.save_model("SMART_spelling_model_deberta-v3-small")

Saving model checkpoint to SMART_spelling_model_deberta-v3-small
Configuration saved in SMART_spelling_model_deberta-v3-small/config.json
Model weights saved in SMART_spelling_model_deberta-v3-small/pytorch_model.bin
tokenizer config file saved in SMART_spelling_model_deberta-v3-small/tokenizer_config.json
Special tokens file saved in SMART_spelling_model_deberta-v3-small/special_tokens_map.json


In [None]:
!zip -r SMART_spelling_model_deberta-v3-small.zip SMART_spelling_model_deberta-v3-small

updating: SMART_spelling_model_deberta-v3-small/ (stored 0%)
updating: SMART_spelling_model_deberta-v3-small/tokenizer_config.json (deflated 45%)
updating: SMART_spelling_model_deberta-v3-small/pytorch_model.bin (deflated 28%)
updating: SMART_spelling_model_deberta-v3-small/config.json (deflated 53%)
updating: SMART_spelling_model_deberta-v3-small/added_tokens.json (stored 0%)
updating: SMART_spelling_model_deberta-v3-small/training_args.bin (deflated 48%)
updating: SMART_spelling_model_deberta-v3-small/tokenizer.json (deflated 77%)
updating: SMART_spelling_model_deberta-v3-small/special_tokens_map.json (deflated 54%)
updating: SMART_spelling_model_deberta-v3-small/spm.model (deflated 50%)


## Inference

In [None]:
# preds for out evaluation set
preds = trainer.predict(trainer.eval_dataset)[1]
preds

The following columns in the test set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: __index_level_0__, input. If __index_level_0__, input are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 286
  Batch size = 256


array([0., 1., 0., 0., 1., 1., 1., 1., 0., 1., 0., 1., 1., 0., 1., 0., 0.,
       0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 1., 1., 0.,
       0., 1., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1.,
       0., 1., 0., 0., 1., 0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 1.,
       0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 1.,
       1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0.,
       0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 1., 0., 1., 1., 1., 1.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0.,
       1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 1.,
       0., 1., 1., 1., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 1.,
       1., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0.,
       1., 1., 0., 0., 0.

In [None]:
preds_df = pd.DataFrame({'input':pd.Series(trainer.eval_dataset['input']), 'preds':preds})

In [None]:
preds_df

Unnamed: 0,input,preds
0,TEXT1: somalia - lower shebelle - merca; TEXT2...,0.0
1,TEXT1: niger - tahoua - tahoua; TEXT2: niger -...,1.0
2,TEXT1: south sudan - bahr el ghazal - northern...,0.0
3,TEXT1: kenya - marsabit - maikona; TEXT2: keny...,0.0
4,TEXT1: chad - kanem - kanem; TEXT2: chad - kan...,1.0
...,...,...
281,TEXT1: bangladesh - rajshahi - sirajgani; TEXT...,1.0
282,TEXT1: kenya - mandera - central; TEXT2: kenya...,0.0
283,TEXT1: sudan - darfur - west darfur; TEXT2: su...,0.0
284,TEXT1: togo - maritime - golfe; TEXT2: togo - ...,1.0


In [None]:
# training, testing, evalling data
data # -> no inference... yet 

Unnamed: 0,input,labels
2052,TEXT1: kenya - makueni - malili; TEXT2: kenya ...,0.0
2343,TEXT1: nepal - eastern - kosi; TEXT2: nepal - ...,1.0
3216,TEXT1: togo - maritime - golfe; TEXT2: togo - ...,1.0
1246,TEXT1: ethiopia - somali - liben; TEXT2: ethio...,1.0
545,TEXT1: cote divoire - woroba - bafing; TEXT2: ...,1.0
...,...,...
2749,TEXT1: somalia - lower shebelle - merca; TEXT2...,0.0
2038,TEXT1: kenya - makueni - nguu; TEXT2: kenya - ...,0.0
2460,TEXT1: pakistan - sindh - thatta; TEXT2: pakis...,0.0
3420,TEXT1: zimbabwe - mashonaland central - rushin...,1.0


In [None]:
# returning to full data
org_data = match_adm[['input','labels']].rename(columns = {'labels':'org_labels'})
eval_df = org_data#.iloc[org_data.index.difference(data.index)] # org_data - data
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

  0%|          | 0/4 [00:00<?, ?ba/s]

In [None]:
preds = trainer.predict(eval_ds).predictions
preds = np.clip(preds, 0, 1) # sigmoid

The following columns in the test set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: input, org_labels. If input, org_labels are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3423
  Batch size = 256


In [None]:
# research the preds
preds[preds >= .5] = 1
preds[preds < .5] = 0

In [None]:
preds

array([[0.],
       [1.],
       [1.],
       ...,
       [1.],
       [1.],
       [1.]], dtype=float16)

In [None]:
eval_df['preds'] = preds

In [None]:
eval_df

Unnamed: 0,input,org_labels,preds
0,TEXT1: afghanistan - daykundi - ishtarlay; TEX...,0.0,0.0
1,TEXT1: afghanistan - laghman - mihtariam; TEXT...,1.0,1.0
2,TEXT1: afghanistan - laghman - qarghayi; TEXT2...,1.0,1.0
3,TEXT1: afghanistan - nangarhar - behsud; TEXT2...,0.0,0.0
4,TEXT1: afghanistan - nangarhar - kama rodat; T...,1.0,1.0
...,...,...,...
3418,TEXT1: zimbabwe - mashonaland central - rushin...,1.0,1.0
3419,TEXT1: zimbabwe - mashonaland central - rushin...,1.0,1.0
3420,TEXT1: zimbabwe - mashonaland central - rushin...,1.0,1.0
3421,TEXT1: zimbabwe - mashonaland central - rushin...,1.0,1.0


In [None]:
# 0 = not the same
# 1 = possibly the same (if diff, because misspelled)
p = eval_df.loc[eval_df.org_labels != eval_df.preds].sample(5).apply(lambda x: print(x['input'], x['org_labels'], x['preds'], '\n'),axis=1)

TEXT1: kenya - turkana - central; TEXT2: kenya - turkana - turkana central 0.0 1.0 

TEXT1: central african republic - lobaye - bimbo; TEXT2: central african republic - lobaye - boda 1.0 0.0 

TEXT1: guatemala - chiquimula - rabinal; TEXT2: guatemala - chiquimula - ipala 1.0 0.0 

TEXT1: guatemala - quiche - gualan; TEXT2: guatemala - quiché - cunén 0.0 1.0 

TEXT1: cote divoire - zanzan - gontougo; TEXT2: côte d'ivoire - zanzan - gontougo 0.0 1.0 



In [None]:
eval_df.loc[eval_df.org_labels != eval_df.preds]

Unnamed: 0,input,org_labels,preds
6,TEXT1: afghanistan - nangarhar - nur; TEXT2: a...,1.0,0.0
7,TEXT1: afghanistan - nangarhar - nur; TEXT2: a...,1.0,0.0
9,TEXT1: afghanistan - ghor - shahrak tula; TEXT...,1.0,0.0
12,TEXT1: afghanistan - bamyan - sayghan; TEXT2: ...,1.0,0.0
409,TEXT1: central african republic - lobaye - bim...,1.0,0.0
...,...,...,...
2862,TEXT1: somalia - galguduud - abudwaq; TEXT2: s...,1.0,0.0
2928,TEXT1: somalia - bari - bayla; TEXT2: somalia ...,0.0,1.0
2929,TEXT1: somalia - bari - bayla; TEXT2: somalia ...,0.0,1.0
2930,TEXT1: somalia - bari - bayla; TEXT2: somalia ...,0.0,1.0


## Next steps:

*   Train secondary model
*   Implement sigmoid to increase accuracy of preds
*   Try to use histroical data
*   Further data cleaning/pre-processing (ex. abbreviations of directions)
*   Hyperparameter tuning
*   Separate model for single admin case
*   Predict actual correct spelling 
*   Try to find complete misses

# Getting corrections

In [None]:
spelled_wrong = match_adm.loc[(match_adm.labels == 1)&(match_adm.score_sort < 100)] # same place as MATCHED_ADM but spelled wrong
spelled_wrong

Unnamed: 0.1,Unnamed: 0,target_adm,SURVEYID,MATCHED_ADM,score_sort,score_partial,ADM_012,GID_2,input,labels,corrected
1,0,afghanistan - laghman - mihtariam,afgh39.csv,afghanistan - laghman - mihtarlam,97,205,afghanistan - laghman - mihtarlam,AFG.20.4_1,TEXT1: afghanistan - laghman - mihtariam; TEXT...,1.0,afghanistan - laghman - mihtarlam
4,0,afghanistan - nangarhar - kama rodat,afgh40.csv,afghanistan - nangarhar - rodat,92,228,afghanistan - nangarhar - rodat,AFG.22.17_1,TEXT1: afghanistan - nangarhar - kama rodat; T...,1.0,afghanistan - nangarhar - rodat
5,0,afghanistan - nangarhar - dara,afgh40.csv,afghanistan - nangarhar - kama,92,221,afghanistan - nangarhar - kama,AFG.22.10_1,TEXT1: afghanistan - nangarhar - dara; TEXT2: ...,1.0,afghanistan - nangarhar - kama
6,0,afghanistan - nangarhar - nur,afgh40.csv,afghanistan - nangarhar - lal pur,89,224,afghanistan - nangarhar - lal pur,AFG.22.13_1,TEXT1: afghanistan - nangarhar - nur; TEXT2: a...,1.0,afghanistan - nangarhar - lal pur
7,1,afghanistan - nangarhar - nur,afgh40.csv,afghanistan - nangarhar - shinwar,89,230,afghanistan - nangarhar - shinwar,AFG.22.19_1,TEXT1: afghanistan - nangarhar - nur; TEXT2: a...,1.0,afghanistan - nangarhar - shinwar
...,...,...,...,...,...,...,...,...,...,...,...
3060,3,sudan - darfur - south darfur,sudn97.csv,sudan - south darfur - buram,86,307990,sudan - south darfur - buram,SDN.14.1_1,TEXT1: sudan - darfur - south darfur; TEXT2: s...,1.0,sudan - south darfur - buram
3061,6,sudan - darfur - south darfur,sudn97.csv,sudan - south darfur - buram,86,307991,sudan - south darfur - buram,SDN.14.1_1,TEXT1: sudan - darfur - south darfur; TEXT2: s...,1.0,sudan - south darfur - buram
3062,0,sudan - darfur - south darfur,sudn99.csv,sudan - south darfur - buram,86,307989,sudan - south darfur - buram,SDN.14.1_1,TEXT1: sudan - darfur - south darfur; TEXT2: s...,1.0,sudan - south darfur - buram
3063,3,sudan - darfur - south darfur,sudn99.csv,sudan - south darfur - buram,86,307990,sudan - south darfur - buram,SDN.14.1_1,TEXT1: sudan - darfur - south darfur; TEXT2: s...,1.0,sudan - south darfur - buram


In [None]:
match_adm['corrected'] = match_adm.reset_index().apply(lambda x: x['MATCHED_ADM'] if x['index'] in spelled_wrong.index else x['target_adm'], axis=1)#.iloc[spelled_wrong.index]['target_adm'] = match_adm.iloc[spelled_wrong.index]['MATCHED_ADM']

In [None]:
match_adm

Unnamed: 0.1,Unnamed: 0,target_adm,SURVEYID,MATCHED_ADM,score_sort,score_partial,ADM_012,GID_2,input,labels,corrected
0,0,afghanistan - daykundi - ishtarlay,afgh38.csv,afghanistan - daykundi - shahristan,85,54,afghanistan - daykundi - shahristan,AFG.6.4_1,TEXT1: afghanistan - daykundi - ishtarlay; TEX...,0.0,afghanistan - daykundi - ishtarlay
1,0,afghanistan - laghman - mihtariam,afgh39.csv,afghanistan - laghman - mihtarlam,97,205,afghanistan - laghman - mihtarlam,AFG.20.4_1,TEXT1: afghanistan - laghman - mihtariam; TEXT...,1.0,afghanistan - laghman - mihtarlam
2,0,afghanistan - laghman - qarghayi,afgh39.csv,afghanistan - laghman - qarghayi,100,206,afghanistan - laghman - qarghayi,AFG.20.5_1,TEXT1: afghanistan - laghman - qarghayi; TEXT2...,1.0,afghanistan - laghman - qarghayi
3,0,afghanistan - nangarhar - behsud,afgh40.csv,afghanistan - nangarhar - hisarak,84,219,afghanistan - nangarhar - hisarak,AFG.22.8_1,TEXT1: afghanistan - nangarhar - behsud; TEXT2...,0.0,afghanistan - nangarhar - behsud
4,0,afghanistan - nangarhar - kama rodat,afgh40.csv,afghanistan - nangarhar - rodat,92,228,afghanistan - nangarhar - rodat,AFG.22.17_1,TEXT1: afghanistan - nangarhar - kama rodat; T...,1.0,afghanistan - nangarhar - rodat
...,...,...,...,...,...,...,...,...,...,...,...
3418,0,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355093,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2,TEXT1: zimbabwe - mashonaland central - rushin...,1.0,zimbabwe - mashonaland central - rushinga
3419,25,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355094,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2,TEXT1: zimbabwe - mashonaland central - rushin...,1.0,zimbabwe - mashonaland central - rushinga
3420,50,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355095,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2,TEXT1: zimbabwe - mashonaland central - rushin...,1.0,zimbabwe - mashonaland central - rushinga
3421,75,zimbabwe - mashonaland central - rushinga,zimb01.csv,zimbabwe - mashonaland central - rushinga,100,355096,zimbabwe - mashonaland central - rushinga,ZWE.4.9_2,TEXT1: zimbabwe - mashonaland central - rushin...,1.0,zimbabwe - mashonaland central - rushinga


In [None]:
match_adm.to_csv('match_adm_corrected_12_19.csv')