## Data Preprocessing - Redone - v5
- Stop removing punctions – This is the work of the tokenizer which already takes care of this step.
- Welcome to Mobile WACh – Some messages still display this salutation – to be removed.
- I suggest you exclude all control participants. These will have variable Study Group == “control” in the participant database. If you include these messages they will be 99% the welcome message.
- We have a set of automated event-based “bounce” messages that the system sends when the above conditions are met 
    - weekend messages get a message that nurses are currently unavailable, 
    - control messages get a message that they won’t get a reply, 
    - people who have stopped messages get a message that they’re no longer active etc. 
  I’ve saved a copy of these messages in OneDrive, here. I think it may make sense to exclude these messages from your analysis, since they don’t really have any content and are triggered by the context in which the preceding participant message was sent. Let me know what you both think about this.
- In this study, the messaging schedule is a little different. Participants receive weekly system messages until 38 weeks estimated gestation. Then 38-40 weeks they receive daily system messages. The first 2 weeks after delivery they receive 2 messages per day. Then every 2 days until 6 weeks. (See the attached paper). In the pilot dataset you used to develop the models in your paper, the schedule was: weekly until 39 weeks, extra message 3 days before delivery, once daily for 2 weeks, every 2 days until 12 weeks. Although the system messages are more frequent right before and after delivery than in the previous dataset, I don’t think this should require any change to our approach of prepending the most recent system message. It’s possible that, particularly when women are sent 2 messages per day, they may be replying to messages earlier than the immediately preceding message. But I don’t think at this point we should change our context approach. I just wanted you to be aware of how the message numbers and timing differ from the previous study.
- Regarding the problem with some mothers’ names being regular words that have meaning (eg “baby”). Did you all come up with any strategies to deal with this? Maybe we should only remove these words from the opening sentence of system messages (where the word is inserted automatically) and leave the rest intact? I imagine participants don’t often use their own name in messages, and we’re not using the nurse messages in our fine-tuning. I think removing real words could be a problem. Tal do you have thoughts?

- Context generation
    - We have seen that after ordering by participant_id and created, system messages are not always first. 
    - Rewrite the context generation function to group by participant messages first then apply system message. If participant message is not preceded by a system message, then that message will be stand alone.
    - Add participant ID on system dataframe for additional check when merging to avoid prepending a system message to a wrong participant

- Other Activities
    - Show number of messages where modified is different from created
- “WEEKEND:”, “NIGHT:” and “WARNING: CONTROL:” are prepended to incoming messages that meet certain conditions to help nurses. WEEKEND is messages sent over the weekend, outside of nurses’ working hours. NIGHT is messages sent at night. WARNING: CONTROL is messages from control participants. 


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import sys

#pd.set_option("display.max_rows", 10)

In [2]:
messages_df = pd.read_csv("../NEO-RCT-labeled/mwach-dump-2022-04-05/mwbase_message-2022-04-05.csv")
participant =  pd.read_csv("../NEO-RCT-labeled/mwach-dump-2022-04-05/mwneo_participant-2022-04-05.csv")

#Remove messages with no participants -- spam
messages_df = messages_df[messages_df.participant_id.notnull()]

#convert participant_id into int64 -- it's read in as float.
messages_df['participant_id'] = messages_df.participant_id.astype('int64')

#Sort by participant and date
messages_df = messages_df.sort_values(by=['participant_id', 'created'], ascending=[True, True])

messages_df.shape

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


(169106, 26)

In [3]:
#I suggest you exclude all control participants. 
#These will have variable Study Group == “control” in the participant database. 
#If you include these messages they will be 99% the welcome message.


participant_df = participant[["ID", "Study Group", "SMS Name", "SMS Messaging Status", "Status"]]
#rename to participant_id
participant_df = participant_df.rename(columns={"ID":"participant_id"})

#merge messages with participant information
messages_df = pd.merge(messages_df, participant_df, how='left', left_on="participant_id", right_on="participant_id")

#filter out controls
messages_df = messages_df[messages_df['Study Group'] != "control"]
messages_df.shape
#Check participants that did not recieve messages
#diff = messages_df.participant_id.values - participant_df.participant_id.values
#diff

(165935, 30)

In [4]:
### Generate sent_by and label variables
messages_df['sent_by'] = np.where(messages_df.System == 1, 'system', 
                                  np.where((messages_df.Out == 1) & (messages_df.System == 0), 'nurse', 'participant'))

#rename urgency column
messages_df = messages_df.rename(columns={'Message Urgency':'Urgency'})

#create label variable
messages_df['label'] = np.where(((messages_df.Urgency == 1) | (messages_df.Urgency == 2)), 1,
                                  np.where((messages_df.Urgency == 3) | (messages_df.Urgency == 4) | (messages_df.Urgency == 5), 0, -2))


In [5]:
maindf_short = messages_df[['ID','created','Out','System','languages','participant_id', 'Urgency','sent_by', 'text', 'label']]
#maindf_short = maindf_short.reset_index()
#info
maindf_short.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165935 entries, 0 to 169104
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   ID              165935 non-null  int64 
 1   created         165935 non-null  object
 2   Out             165935 non-null  int64 
 3   System          165935 non-null  int64 
 4   languages       75515 non-null   object
 5   participant_id  165935 non-null  int64 
 6   Urgency         165935 non-null  int64 
 7   sent_by         165935 non-null  object
 8   text            165935 non-null  object
 9   label           165935 non-null  int32 
dtypes: int32(1), int64(5), object(4)
memory usage: 13.3+ MB


In [6]:
import re
def removeBounceSalutation(text):
    ''' #Remove salutation '''
    sent = str(text)
    result = re.match(r"{name}, this is {nurse} from {clinic}.", sent)
    result2 = re.match(r"Welcome to Mobile WACh NEO!", sent)
    result_swahili = re.match(r"{name}, huyu ni {nurse} kutoka {clinic}.", sent)
    result_luo = re.match(r"{name}, mae en {nurse} mawuok {clinic}.", sent)
    result_luo2 = re.match(r"{name}, ma en {nurse} mawuok {clinic}.", sent)
    
    #removing Asante kwa ujumbe wako {name}.
    result3 = re.match(r"Thank you for your\xa0message, {name}.", sent)
    result_swahili2 = re.match(r"Asante kwa ujumbe wako {name}.", sent)
    result_luo3 = re.match(r"Erokamano kuom oteni {name}.", sent)
    

    if(result):
        sentence = sent.split(str(result.group()),1)[1]
    elif(result_swahili):
        sentence = sent.split(str(result_swahili.group()),1)[1]
    elif(result_luo):
        sentence = sent.split(str(result_luo.group()),1)[1]
    elif(result_luo2):
        sentence = sent.split(str(result_luo2.group()),1)[1]
        
    elif(result2):
        sentence = sent.split(str(result2.group()),1)[1]
    elif(result3):
        sentence = sent.split(str(result3.group()),1)[1]
        
    elif(result_swahili2):
        sentence = sent.split(str(result_swahili2.group()),1)[1]
    elif(result_luo3):
        sentence = sent.split(str(result_luo3.group()),1)[1]
        
    else:
        sentence = sent

    return sentence.strip()

In [7]:
#Remove "bounce" messages - Load as csv.
bounce_messages = pd.read_csv("Data/BounceMessages.csv")
bounce_msgs = bounce_messages[~bounce_messages.Group.isna()]
bounce_msgs.English.values

array(["{name}, this is {nurse} from {clinic}. Welcome to Mobile WACh NEO! Thank you for joining the study! We would like you to inform us by SMS, flash or phone call when you have delivered your baby and if you have any medical emergencies. We ask that you attend the 2- and 6-week follow-up visits as explained to you at enrolment. Please call or text us with any concerns about your participation.\nYou will be receiving weekly and then daily SMS to guide you during pregnancy and the first months of your baby's life. Please text to ask any questions and share health concerns in your pregnancy and after your baby is born.   A nurse is available Monday to Friday during working hours.  We are here for you!  Please text when you deliver! ",
       'Welcome to Mobile WACh NEO! Thank you for joining the study! We would like you to inform us by SMS, flash or phone call when you have delivered your baby and if you have any medical emergencies. We ask that you attend the 2- and 6-week follow-up 

In [8]:
#bounce_msgs.iloc[0,[5,6,7]].values
bounce_msgs = bounce_msgs.assign(English = bounce_msgs.English.apply(lambda text: removeBounceSalutation(text)))
bounce_msgs = bounce_msgs.assign(English = bounce_msgs.English.apply(lambda text: removeBounceSalutation(text))) #to remove second sentence - welcome to mWACh NEO!
bounce_msgs = bounce_msgs.assign(Swahili = bounce_msgs.Swahili.apply(lambda text: removeBounceSalutation(text)))
bounce_msgs = bounce_msgs.assign(Luo = bounce_msgs.Luo.apply(lambda text: removeBounceSalutation(text)))

bounce_msgs.English.values

array(["Thank you for joining the study! We would like you to inform us by SMS, flash or phone call when you have delivered your baby and if you have any medical emergencies. We ask that you attend the 2- and 6-week follow-up visits as explained to you at enrolment. Please call or text us with any concerns about your participation.\nYou will be receiving weekly and then daily SMS to guide you during pregnancy and the first months of your baby's life. Please text to ask any questions and share health concerns in your pregnancy and after your baby is born.   A nurse is available Monday to Friday during working hours.  We are here for you!  Please text when you deliver!",
       'Thank you for joining the study! We would like you to inform us by SMS, flash or phone call when you have delivered your baby and if you have any medical emergencies. We ask that you attend the 2- and 6-week follow-up visits as explained to you at enrolment. Please call or text us with any concerns about your par

In [9]:
messages_df['Urgency'].value_counts(dropna=False)

-2    156283
 5      3682
 3      2334
 2      1786
 4      1399
 1       451
Name: Urgency, dtype: int64

In [10]:
messages_df['System'].value_counts(dropna=False)

1    90353
0    75582
Name: System, dtype: int64

In [11]:
messages_df.Out.value_counts(dropna=False)

1    119632
0     46303
Name: Out, dtype: int64

In [12]:
messages_df.sent_by.value_counts(dropna=False)

system         90353
participant    46303
nurse          29279
Name: sent_by, dtype: int64

In [13]:
messages_df.label.value_counts(dropna=False)

-2    156283
 0      7415
 1      2237
Name: label, dtype: int64

In [14]:
#Save raw dataset
#messages_final.to_csv("Data/messages_df.csv")

### Data Cleaning


In [15]:
import re
def removeSalutation(text):
    ''' #Remove salutation '''
    sent = str(text)
    result = re.match(r".+, this is \b[A-Z][a-z]+ from (\w+)\.", sent)
    result_swahili = re.match(r".+, huyu ni \b[A-Z][a-z]+ kutoka (\w+)\.", sent)
    result_luo = re.match(r".+, mae en \b[A-Z][a-z]+ mawuok (\w+)\.", sent)
    result_luo2 = re.match(r".+, ma en \b[A-Z][a-z]+ mawuok (\w+)\.", sent)
    
    result_welcome = re.match(r".+Welcome to Mobile WACh.", sent)
    result_welcome2 = re.match(r".+Welcome to Mobile WACh NEO.", sent)
    
    result_karibu = re.match(r".+Karibu kwa Mobile WACh.", sent)
    result_karibu2 = re.match(r"Karibu kwa Mobile WACh NEO.", sent)
    result_karibu3 = re.match(r" Karibu kwa Mobile WACh NEO.", sent)
    #mobile WACh NEO
    result_wachneo = re.match(r".+ mobile WACh (\w+)\.", sent)
    result_wachcaps = re.match(r".+ Mobile WACh (\w+)\.", sent)
    result_wachcaps2 = re.match(r".+ Mobile WACh NEO!", sent)
    

    if(result):
        sentence = sent.split(str(result.group()),1)[1]
    elif(result_swahili):
        sentence = sent.split(str(result_swahili.group()),1)[1]
        
    elif(result_luo):
        sentence = sent.split(str(result_luo.group()),1)[1]
    elif(result_luo2):
        sentence = sent.split(str(result_luo2.group()),1)[1] 
        
    elif(result_wachneo):
        sentence = sent.split(str(result_wachneo.group()),1)[1]
    elif(result_welcome):
        sentence = sent.split(str(result_welcome.group()),1)[1]
        
    
    elif(result_karibu2):
        sentence = sent.split(str(result_karibu2.group()),1)[1]
    elif(result_karibu3):
        sentence = sent.split(str(result_karibu3.group()),1)[1]
    elif(result_karibu):
        sentence = sent.split(str(result_karibu.group()),1)[1]
        
    elif(result_wachcaps):
        sentence = sent.split(str(result_wachcaps.group()),1)[1]
    elif(result_wachcaps2):
        sentence = sent.split(str(result_wachcaps2.group()),1)[1]
    else:
        sentence = sent

    return sentence.strip()

In [16]:
#1. remove salutations
text_column = "text"
maindf = messages_df
maindf = maindf.assign(text = maindf.text.apply(lambda text: removeSalutation(text)))
#check for second sentences
maindf = maindf.assign(text = maindf.text.apply(lambda text: removeSalutation(text))) #Welcome to mWACh NEO!


In [17]:
#2. Search and remove bounce messages
def cleanMainDf(mid, text):
    english = list()
    #check if this text is present in bounce
    english = bounce_msgs.English.apply(lambda substr: checkMessage(substr, text))
    swahili = bounce_msgs.Swahili.apply(lambda substr: checkMessage(substr, text))
    luo = bounce_msgs.Luo.apply(lambda substr: checkMessage(substr, text))
    
    if any(english) | any(swahili) | any(luo):
        return True
    else:
        return False
    
    
def checkMessage(substr, text):
    if substr in text:
        return True
    else:
        return False
    
    
maindf['bounce'] = maindf.apply(lambda row: cleanMainDf(row.ID, row.text), axis=1)


In [18]:
maindf.bounce.value_counts()
#maindf.head()

False    161735
True       4200
Name: bounce, dtype: int64

In [19]:
#remove bounce messages
maindf = maindf[maindf.bounce == False]
maindf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 161735 entries, 0 to 169104
Data columns (total 33 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   ID                     161735 non-null  int64  
 1   created                161735 non-null  object 
 2   modified               161735 non-null  object 
 3   text                   161735 non-null  object 
 4   Out                    161735 non-null  int64  
 5   System                 161735 non-null  int64  
 6   Viewed                 161735 non-null  int64  
 7   is related             46287 non-null   float64
 8   parent_id              27755 non-null   float64
 9   action time            46284 non-null   object 
 10  translated text        77154 non-null   object 
 11  Translated             161735 non-null  object 
 12  translation time       57725 non-null   object 
 13  languages              75469 non-null   object 
 14  admin_user_id          29233 non-nul

In [20]:
#3. Remove control keywords
import re
controls_keywords = ["WEEKEND:", "NIGHT:", "WARNING: CLIENT EXITED FROM STUDY:"]
maindf['text'] = maindf.text.str.replace('|'.join(map(re.escape, controls_keywords)), '')
msgs_with_controls = maindf[maindf.text.str.contains('|'.join(map(re.escape, controls_keywords)))] #'|'.join(controls_keywords) 
msgs_with_controls

Unnamed: 0,ID,created,modified,text,Out,System,Viewed,is related,parent_id,action time,...,Urgency,last system message,Replied,Study Group,SMS Name,SMS Messaging Status,Status,sent_by,label,bounce


In [21]:
#3. Calculate context
def generateContext(df):
    df['context_id'] = 0
    context_id = 0
    for row_index, row in df.iterrows():
        if(row.sent_by == "system"):
            context_id = row.ID
        df.at[row_index, 'context_id'] = context_id
    return df
        
maindf_grouped = maindf.groupby("participant_id")

df_final = pd.DataFrame()
for group_name, group in maindf_grouped:
    df = generateContext(group)
    df_final=df_final.append(df)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['context_id'] = 0


In [26]:
#pd.set_option('display.max_rows', None)
df_final = df_final[['ID','created','participant_id', 'sent_by', 'context_id', 'languages', 'Urgency', 'label', 'text']]

#df_final.sample(100, random_state=10)[['participant_id']]

#df_final[df_final.context_id == 0]
df_final.to_csv("Data/maindf_v5.csv")

In [27]:
#4. System messages
system_df = df_final[df_final['sent_by'] == 'system']
system_df = system_df[["ID", "text", "participant_id"]]
system_df['system_text'] = system_df.text
system_df['system_participantID'] = system_df.participant_id
system_df = system_df[["ID", "system_text", "system_participantID"]]
system_df.head()

Unnamed: 0,ID,system_text,system_participantID
0,8391,You will be receiving weekly and then daily SM...,87
1,8392,Thank you for joining the study! You will rece...,87
4,8452,Be sure to come in for all your antenatal care...,87
5,8453,Be sure to come in for all your antenatal care...,87
13,8557,Sometimes pregnancy and mothehood can bring on...,87


In [76]:
#5. Participants messages - System Context
participant_df = df_final[df_final['sent_by'] == 'participant']
#Participants - add system context
part_context = pd.merge(participant_df, system_df, how="left", left_on="context_id", right_on="ID")
#dealing with all messages, including those without preceding context
part_context['contextualized'] = np.where(part_context.context_id == 0, part_context.text, part_context.system_text+" "+part_context.text)
participant_context = part_context[['ID_x', 'participant_id', 'context_id', 'contextualized', 'Urgency', 'label', 'system_participantID']]
#part_context.head()
participant_context.to_csv("Data/system_context_all_v5.csv")

In [24]:
#6. check that system partID and participant ID are matching
participant_context[participant_context["participant_id"] != participant_context["system_participantID"]]

Unnamed: 0,ID_x,participant_id,context_id,contextualized,label,system_participantID
489,9322,147,0,Thank you so much really appreciated for your ...,-2,
926,10553,166,0,hi there am marlyne i have lost my phone if yo...,-2,
927,12352,166,0,Good Morning,-2,
928,12354,166,0,Fine,-2,
929,12359,166,0,Okay Thanks,-2,
...,...,...,...,...,...,...
46290,173900,3838,0,thank you I appreciate your work,0,
46295,173886,3846,0,Thank you,0,
46296,173891,3847,0,Ok,0,
46297,174420,3857,0,THANK YOU.,0,


In [25]:
#7. generate datasets
participant_context['mid'] = participant_context.ID_x
participant_context['text'] = participant_context.contextualized
participant_context = participant_context[['mid', 'participant_id', 'text', 'label']]

#get unlabeled system_context messages
system_context_unlbl = participant_context[participant_context.label < 0]
system_context_unlbl.to_csv("Data/system_context_all_unlbl_v5.csv")

#get labeled system_context messages
system_context_lbl = participant_context[participant_context.label >= 0]
system_context_lbl.to_csv("Data/system_context_all_lbl_v5.csv")

#System Context train test split
train_df, remain_df = train_test_split(system_context_lbl, random_state=42, train_size=0.7, stratify=system_context_lbl.label.values)
test_df, dev_df = train_test_split(remain_df, random_state=42, train_size=0.7, stratify=remain_df.label.values)

train_df.to_csv("Data/train_df_v5.csv")
test_df.to_csv("Data/test_df_v5.csv")
dev_df.to_csv("Data/dev_df_v5.csv")

#formulate system context pretraining dataset
system_context_pretrain_df = pd.concat([system_context_unlbl, train_df])
system_context_pretrain_df.to_csv("Data/system_context_pretrain_v5.csv")
system_pretrain_train, system_pretrain_validation = train_test_split(system_context_pretrain_df, test_size=0.2, random_state=42)
system_pretrain_train.text.to_csv("Data/system_pretrain_train_v5.csv")
system_pretrain_validation.text.to_csv("Data/system_pretrain_dev_v5.csv")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  participant_context['mid'] = participant_context.ID_x
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  participant_context['text'] = participant_context.contextualized


In [26]:
#8. Pretraining sets on labeled data
#remove test set from all labeled dataset
pretrain_lbl_df = system_context_lbl[~system_context_lbl.mid.isin(list(test_df.mid)) ]

system_pretrain_train_lbl, system_pretrain_dev_lbl = train_test_split(pretrain_lbl_df, test_size=0.2, random_state=42)

system_pretrain_train_lbl.text.to_csv("Data/system_pretrain_train_lbl_v5.csv")
system_pretrain_dev_lbl.text.to_csv("Data/system_pretrain_dev_lbl_v5.csv")


In [60]:
#9. Combining Pilot and R01 datasets -- ALL Dataset
#Training set
pretrain_train_pilot = pd.read_csv("Data/pretrain_train_pilot_all_v4.csv")
pretrain_train_pilot.drop(["Unnamed: 0"], axis=1, inplace=True)
combined_pretrain_train = pd.concat([system_pretrain_train, pretrain_train_pilot]).reset_index(drop=True)
combined_pretrain_train = combined_pretrain_train[["text"]]
combined_pretrain_train.to_csv("Data/combined_pretrain_train_v5.csv")

#Validation set
pretrain_dev_pilot = pd.read_csv("Data/pretrain_validation_pilot_all_v4.csv")
pretrain_dev_pilot.drop(["Unnamed: 0"], axis=1, inplace=True)
combined_pretrain_dev = pd.concat([system_pretrain_validation, pretrain_dev_pilot]).reset_index(drop=True)
combined_pretrain_dev = combined_pretrain_dev[["text"]]
combined_pretrain_dev.to_csv("Data/combined_pretrain_dev_v5.csv")
                             
combined_pretrain_train.shape, combined_pretrain_dev.shape

((50147, 1), (12538, 1))

In [61]:
#10. Combining Pilot and R01 datasets -- labeled Dataset
#Training set
pretrain_train_pilot_lbl = pd.read_csv("Data/pretrain_train_pilot_v4.csv")
pretrain_train_pilot_lbl.drop(["Unnamed: 0"], axis=1, inplace=True)
combined_pretrain_train_lbl = pd.concat([system_pretrain_train_lbl, pretrain_train_pilot_lbl]).reset_index(drop=True)
combined_pretrain_train_lbl = combined_pretrain_train_lbl[["text"]]
combined_pretrain_train_lbl.to_csv("Data/combined_pretrain_train_lbl_v5.csv")

#Validation set
pretrain_dev_pilot_lbl = pd.read_csv("Data/pretrain_validation_pilot_v4.csv")
pretrain_dev_pilot_lbl.drop(["Unnamed: 0"], axis=1, inplace=True)
combined_pretrain_dev_lbl = pd.concat([system_pretrain_dev_lbl, pretrain_dev_pilot_lbl]).reset_index(drop=True)
combined_pretrain_dev_lbl = combined_pretrain_dev_lbl[["text"]]
combined_pretrain_dev_lbl.to_csv("Data/combined_pretrain_dev_lbl_v5.csv")
                             
combined_pretrain_train_lbl.shape, combined_pretrain_dev_lbl.shape

((7281, 1), (1821, 1))

In [75]:
#11. Train, Dev, Test sets for combined fine tuning tasks
#v4 - datasets
train_df_v4 = pd.read_csv("Data/system_train_pilot_v4.csv")
train_df_v4 = train_df_v4[['text', 'label']]
test_df_v4 = pd.read_csv("Data/system_test_pilot_v4.csv")
test_df_v4 = test_df_v4[['text', 'label']]
dev_df_v4 = pd.read_csv("Data/system_dev_pilot_v4.csv")
dev_df_v4 = dev_df_v4[['text', 'label']]

train_df = train_df[['text', 'label']]
test_df = test_df[['text', 'label']]
dev_df = dev_df[['text', 'label']]

combined_train_df = pd.concat([train_df, train_df_v4]).reset_index(drop=True)
combined_test_df = pd.concat([test_df, test_df_v4]).reset_index(drop=True)
combined_dev_df = pd.concat([dev_df, dev_df_v4]).reset_index(drop=True)

combined_train_df.to_csv("Data/combined_train_df_v5.csv")
combined_test_df.to_csv("Data/combined_test_df_v5.csv")
combined_dev_df.to_csv("Data/combined_dev_df_v5.csv")

