<a href="https://colab.research.google.com/github/ymoslem/MT-Preparation/blob/main/extra/oversampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Oversampling

In some toolkits like OpenNMT{py,tf}, you can apply oversampling during training using the "dataset weights" feature. However, this notebook explains how to apply it *manually* to datasets as part of data preperation, using *Pandas*.

In [None]:
import pandas as pd

Assume that "data" includes two domains. We will use the lable *0* for the domain with larger data, and label *1* for the domain with smaller data. In this toy example, the first domain has 4 translation pairs, while the second domain has 16 translation pairs.

In [None]:
data = {
        "label": [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "source": [
                  "Sunny skies and gentle breeze.",
                  "Rainy with thunderstorms in evening.",
                  "Clear night with a full moon.",
                  "Cloudy and cool, chance of showers.",
                  "Students gather, eager to learn.",
                  "New books, fresh start, endless possibilities.",
                  "Friends reunite, laughter fills corridors.",
                  "Teachers inspire, minds come alive.",
                  "Busy hallways, buzzing with excitement.",
                  "Homework assignments, challenges to conquer.",
                  "Exploring new subjects, expanding knowledge.",
                  "Lunchtime chatter, delicious meals shared.",
                  "Sports teams practice, readying for competition.",
                  "Art projects, creativity unleashed on canvas.",
                  "Science experiments, discoveries waiting ahead.",
                  "Math problems solved, confidence grows.",
                  "Field trips, adventures beyond classroom.",
                  "Class discussions, diverse ideas shared.",
                  "Exam time, studying late nights.",
                  "Graduation nears, futures take shape."
  ],
        "target": [
                  "Ciel ensoleillé et légère brise.",
                  "Pluvieux avec des orages le soir.",
                  "Nuit claire avec une pleine lune.",
                  "Nuageux et frais, risque d'averses.",
                  "Les élèves se rassemblent, impatients d'apprendre.",
                  "Nouveaux livres, nouveau départ, possibilités infinies.",
                  "Les amis se retrouvent, les couloirs résonnent de rires.",
                  "Les enseignants inspirent, les esprits s'éveillent.",
                  "Couloirs animés, bourdonnant d'excitation.",
                  "Devoirs à faire, défis à relever.",
                  "Exploration de nouvelles matières, élargissement des connaissances.",
                  "Bavardages à l'heure du déjeuner, repas délicieux partagés.",
                  "Les équipes sportives s'entraînent, se préparant à la compétition.",
                  "Projets artistiques, créativité libérée sur la toile.",
                  "Expériences scientifiques, découvertes en attente.",
                  "Problèmes de maths résolus, confiance qui grandit.",
                  "Sorties scolaires, aventures au-delà de la salle de classe.",
                  "Discussions en classe, partage d'idées diverses.",
                  "Temps des examens, études jusqu'à tard dans la nuit.",
                  "La remise des diplômes approche, les futurs prennent forme."
  ]
}

In [None]:
df = pd.DataFrame(data)

In [None]:
print(df.shape)

df

(20, 3)


Unnamed: 0,label,source,target
0,0,Sunny skies and gentle breeze.,Ciel ensoleillé et légère brise.
1,0,Rainy with thunderstorms in evening.,Pluvieux avec des orages le soir.
2,0,Clear night with a full moon.,Nuit claire avec une pleine lune.
3,0,"Cloudy and cool, chance of showers.","Nuageux et frais, risque d'averses."
4,1,"Students gather, eager to learn.","Les élèves se rassemblent, impatients d'appren..."
5,1,"New books, fresh start, endless possibilities.","Nouveaux livres, nouveau départ, possibilités ..."
6,1,"Friends reunite, laughter fills corridors.","Les amis se retrouvent, les couloirs résonnent..."
7,1,"Teachers inspire, minds come alive.","Les enseignants inspirent, les esprits s'éveil..."
8,1,"Busy hallways, buzzing with excitement.","Couloirs animés, bourdonnant d'excitation."
9,1,"Homework assignments, challenges to conquer.","Devoirs à faire, défis à relever."


Now, let's randomly oversample the smaller domain data (with label *0*), to make it more balanced. In this case, it will now have the same number of translation pairs as the larger domain data (with label *1*).

In [None]:
# Oversampling the smaller dataset
# Reference: https://medium.com/analytics-vidhya/undersampling-and-oversampling-an-old-and-a-new-approach-4f984a0e8392


def oversample(df):
  classes = df.label.value_counts().to_dict()
  most = max(classes.values())
  classes_list = []
  for key in classes:
    classes_list.append(df[df['label'] == key])
  classes_sample = []
  for i in range(1,len(classes_list)):
    classes_sample.append(classes_list[i].sample(most, replace=True))
  df_maybe = pd.concat(classes_sample)
  final_df = pd.concat([df_maybe,classes_list[0]], axis=0)
  final_df = final_df.reset_index(drop=True)
  return final_df

Compare the new dataframe with the original one, and notice how the data with the lable *0* is now oversampled. Congratulations, now you have a balanced dataset!

In [None]:
oversampled_df = oversample(df)

print(oversampled_df.shape)

oversampled_df

(32, 3)


Unnamed: 0,label,source,target
0,0,Clear night with a full moon.,Nuit claire avec une pleine lune.
1,0,Rainy with thunderstorms in evening.,Pluvieux avec des orages le soir.
2,0,Clear night with a full moon.,Nuit claire avec une pleine lune.
3,0,Rainy with thunderstorms in evening.,Pluvieux avec des orages le soir.
4,0,Clear night with a full moon.,Nuit claire avec une pleine lune.
5,0,Rainy with thunderstorms in evening.,Pluvieux avec des orages le soir.
6,0,Rainy with thunderstorms in evening.,Pluvieux avec des orages le soir.
7,0,"Cloudy and cool, chance of showers.","Nuageux et frais, risque d'averses."
8,0,Sunny skies and gentle breeze.,Ciel ensoleillé et légère brise.
9,0,"Cloudy and cool, chance of showers.","Nuageux et frais, risque d'averses."
