# Dialect Classification Data Preparation and Cleaning

This notebook demonstrates the process of extracting data from an SQLite database, cleaning the text data, balancing the dataset, and saving the final processed data to a CSV file. This is essential for training a model for dialect classification.

## Step 1: Import Libraries

First, we import the necessary libraries for data manipulation and text processing.


### Why?<br>
sqlite3: To connect to the SQLite database and execute SQL queries.<br>
pandas: For data manipulation and analysis.<br>
### Benefit<br>
These libraries provide the essential tools needed to handle, process, and clean the data efficiently.<br>

In [1]:
import sqlite3
import pandas as pd

## Step 2: Extract Data from SQLite Database
We connect to the SQLite database and fetch the data from two tables: id_text and id_dialect. The data is then merged into a single DataFrame.

## Why? <br>
To retrieve the data stored in the SQLite database and combine it into a single DataFrame that includes both text and dialect information.<br>
## Benefit<br>
This step ensures that we have a unified dataset with all necessary information in one place, making it easier to process and analyze<br>

In [2]:
# Connect to the database
conn = sqlite3.connect('Data/dialects_database.db')
cursor = conn.cursor()

In [3]:
# Fetch table names (for information purposes)
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
table_names = [table[0] for table in tables]

In [4]:
# Fetch all rows from the id_text and id_dialect tables
cursor.execute('SELECT * FROM id_text')
id_text = cursor.fetchall()
cursor.execute('SELECT * FROM id_dialect')
id_dialect = cursor.fetchall()

In [5]:
# Close the connection
cursor.close()
conn.close()

In [6]:
# Create DataFrames from the fetched data
id_text_df = pd.DataFrame(id_text, columns=['id', 'text'])
id_dialect_df = pd.DataFrame(id_dialect, columns=['id', 'dialect'])

# Merge the two DataFrames on the 'id' column
merged_df = pd.merge(id_text_df, id_dialect_df, on='id')
merged_df.head()

Unnamed: 0,id,text,dialect
0,1009754958479151232,@toha_Altomy @gy_yah قليلين ادب ومنافقين. لو ا...,LY
1,1009794751548313600,@AlmFaisal 😂😂 الليبيين متقلبين!!!\nبس بالنسبة ...,LY
2,1019989115490787200,@smsm071990 @ALMOGRBE كل 20 تانيه شاب ليبي بير...,LY
3,1035479791758135168,@AboryPro @lyranoo85 رانيا عقليتك متخلفة. اولا...,LY
4,1035481122921164800,@lyranoo85 شكلك متعقدة علشان الراجل لي تحبيه ا...,LY


## Step 3: Save the Merged DataFrame
The merged DataFrame is saved to a CSV file for future use

In [7]:
# Save the merged DataFrame to a CSV file
merged_df.to_csv('Data/merged_dataframe.csv', index=False, encoding='utf-8-sig')

## Step 4: Data Cleaning
We remove the 'id' column and clean the text data by removing diacritics, non-Arabic characters, extra spaces, and stopwords.

## Why? <br>
Remove 'id' column: The 'id' column is not needed for text analysis and classification.<br>
Text Cleaning: To standardize the text by removing unnecessary characters and stopwords which can improve the performance of the machine learning model.<br>
## Benefit<br>
This step ensures that the text data is in a clean and uniform format, which is crucial for accurate text classification.<br>

In [8]:
df = merged_df.copy()

In [9]:
df = df.drop(columns=['id'])

In [10]:
df['dialect'].unique()

array(['LY', 'MA', 'EG', 'LB', 'SD'], dtype=object)

In [11]:
dialect_counts = df['dialect'].value_counts()
dialect_counts

dialect
EG    57636
LY    36499
LB    27617
SD    14434
MA    11539
Name: count, dtype: int64

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147725 entries, 0 to 147724
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   text     147725 non-null  object
 1   dialect  147725 non-null  object
dtypes: object(2)
memory usage: 2.3+ MB


In [13]:
#split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['dialect'], test_size=0.2, random_state=42)

In [14]:
from Functions import clean_text

In [15]:
#apply clean text function to train data
X_train = X_train.apply(clean_text)
#apply clean text function to test data
X_test = X_test.apply(clean_text)

In [16]:
#convert to dataframe
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)
y_train = pd.DataFrame(y_train)
y_test = pd.DataFrame(y_test)

In [17]:
#drop null values
X_train = X_train.dropna(subset=['text'])
X_test = X_test.dropna(subset=['text'])

In [18]:
#save cleaned dataframes
X_train.to_csv('Data/X_train.csv', index=False, encoding='utf-8-sig')
X_test.to_csv('Data/X_test.csv', index=False, encoding='utf-8-sig')
y_train.to_csv('Data/y_train.csv', index=False, encoding='utf-8-sig')
y_test.to_csv('Data/y_test.csv', index=False, encoding='utf-8-sig')

## Step 5: Upsample Data to Balance Classes
We perform upsampling on the minority classes to balance the dataset.<br> 

## Benefit
Balancing the dataset helps in training a more robust and fair model, reducing the bias towards majority classes and improving overall performance.

In [19]:
# df_X_train = pd.read_csv('Data/X_train.csv')
# df_Y_train = pd.read_csv('Data/y_train.csv')

In [20]:
#concatenate features and labels
df = pd.concat([X_train, y_train], axis=1)

In [21]:
df.head()

Unnamed: 0,text,dialect
57485,الدنيا دي الحلو والوحش,EG
61118,انا بقيت اعمل كده علي فكره ليه اوجع ايدي تقليب...,EG
48452,البوست بتاع خافيير سولانا طلع انه هاكر تركي اخ...,EG
95030,احلي دي ايه افريقياياوداد,EG
96136,حبيبي قدها كبير والله,EG


In [22]:
#chcek data balance
df.dialect.value_counts()

dialect
EG    46152
LY    29231
LB    22039
SD    11502
MA     9256
Name: count, dtype: int64

In [23]:
#upsanpling data to make it balanced
from sklearn.utils import resample

# Separate majority and minority classes
majority_classes = df[df['dialect'].isin(['EG', 'LY', 'LB'])]
minority_classes = df[~df['dialect'].isin(['EG', 'LY', 'LB'])]  

# Upsample minority class to match majority class size
minority_upsampled = resample(minority_classes, 
                              replace=True,     # sample with replacement
                              n_samples=int(len(majority_classes)*0.5),    
                              random_state=42) # reproducible results

# Combine minority_upsampled with majority_classes
balanced_df = pd.concat([minority_upsampled, majority_classes])

# Shuffle the dataset
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

In [24]:
print(balanced_df['dialect'].value_counts())

dialect
EG    46152
LY    29231
SD    26854
LB    22039
MA    21857
Name: count, dtype: int64


In [25]:
balanced_df = balanced_df.dropna(subset=['text'])

## Step 7: Save the Balanced DataFrame
Finally, we save the balanced DataFrame to a CSV file for use in model training.

In [26]:
#save balanced dataframe
balanced_df.to_csv('Data/Train_data.csv', index=False, encoding='utf-8-sig')

In [27]:
#save test data
Test_data = pd.concat([X_test, y_test], axis=1)

In [28]:
#SAVE TEST DATA
Test_data.to_csv('Data/Test_data.csv', index=False, encoding='utf-8-sig')