# Feature Extraction Module

This module focuses on extracting features for words and characters. It includes both word-level and character-level feature extraction techniques.

## Word Level Features

### Bag of Words

The Bag of Words technique represents each word in a document as a vector, where each element of the vector corresponds to the frequency of a specific word in the document.

### TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects the importance of a word in a document. It takes into account both the frequency of the word in the document and the frequency of the word in the entire corpus.

### Word Embeddings

Word embeddings are dense vector representations of words that capture semantic relationships between words. They can be learned from large text corpora or obtained from pre-trained models like BERT or ELMo, which provide contextualized word embeddings.

## Character Level Features

Character level features consider the characters within words, taking into account the characters preceding and following them.

### Character Embeddings

Character embeddings are dense vector representations of characters that capture the relationships between characters. They can be learned from the data or obtained from pre-trained models.

### One-Hot Encoding

One-Hot Encoding represents each character as a binary vector, where each element of the vector corresponds to a specific character. This technique is useful for capturing categorical information about characters.


In [7]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %pip install seaborn
import seaborn as sns
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

### Word level features

In [16]:
def bag_of_words(df, col):
    # Create a bag of words
    bow = CountVectorizer()
    bow.fit(df[col])
    bow = bow.transform(df[col])
    bow_df = pd.DataFrame(bow.toarray())
    return bow_df

In [18]:
def tfidf(df, col):
    # Create a tfidf
    tfidf = TfidfVectorizer()
    tfidf.fit(df[col])
    tfidf = tfidf.transform(df[col])
    tfidf_df = pd.DataFrame(tfidf.toarray())
    return tfidf_df


In [10]:
# %pip install fasttext
import fasttext
def create_arabic_word_embedding(text_file, model_file):
    # Train the FastText model on the Arabic text file
    model = fasttext.train_unsupervised(text_file, model='skipgram')
    #return feature vector for each word
    model.save_model(model_file)
    return model

#test the model and print the most similar words
def test_model(model_file):
    model = fasttext.load_model(model_file)
    print(model.get_nearest_neighbors('الله'))
    print(model.get_nearest_neighbors('الملك'))
    print(model.get_nearest_neighbors('الملكة'))
    print(model.get_nearest_neighbors('الملكي'))
    print(model.get_nearest_neighbors('الملكية'))

text_file = 'processed_output.txt'
model_file = 'model.bin'
create_arabic_word_embedding(text_file, model_file)
test_model(model_file)





[(0.928964376449585, 'اللهِ'), (0.8942822813987732, 'عليه'), (0.8932053446769714, 'صلى'), (0.88192218542099, 'اللهُ'), (0.8680329322814941, 'وسلم'), (0.7963157296180725, 'الأَنْصَارِىِّ'), (0.7874833345413208, 'اللَّه'), (0.787224531173706, 'النّاسُ'), (0.7832015156745911, 'النُّعْمَانِ'), (0.7828898429870605, 'النُّعْمَانُ')]
[(0.7418180704116821, 'النُّفُوسِ'), (0.7313854694366455, 'اللهُ'), (0.7193009853363037, 'السِّدْرِ'), (0.7180575728416443, 'التَّشَبُّهِ'), (0.7086499333381653, 'الله'), (0.7077236175537109, 'الْقُمَاشِ'), (0.7072348594665527, 'الشَّكْوَى'), (0.7053272724151611, 'الْكِبْرَ'), (0.7052369117736816, 'النُّفُوذِ'), (0.7025838494300842, 'الصِّبْيَانَ')]
[(0.7427960634231567, 'النُّفُوسِ'), (0.7381163835525513, 'اللهُ'), (0.7364920377731323, 'السِّدْرِ'), (0.7363826632499695, 'التَّشَبُّهِ'), (0.7237018346786499, 'الطِّيبَ'), (0.7141523957252502, 'الدَّلْوِ'), (0.7117125988006592, 'الله'), (0.7079038619995117, 'الشَّكْوَى'), (0.7070366144180298, 'الْقُمَاشِ'), (0.7068

In [11]:
#more tests
text_file = 'words.txt'
model_file = 'model.bin'
create_arabic_word_embedding(text_file, model_file)
model = fasttext.load_model(model_file)
print(model.get_nearest_neighbors('الصلاة'))
print(model.get_nearest_neighbors('الصيام'))
print(model.get_nearest_neighbors('الزكاة'))
print(model.get_nearest_neighbors('الحج'))
print(model.get_nearest_neighbors('الإسلام'))
print(model.get_nearest_neighbors('الإيمان'))
print(model.get_nearest_neighbors('الإحسان'))
print(model.get_nearest_neighbors('السلام'))
print(model.get_nearest_neighbors('المسلم'))
print(model.get_nearest_neighbors('يجوز'))
print(model.get_nearest_neighbors('يجب'))
print(model.get_nearest_neighbors('يحرم'))



[(0.8729323148727417, 'كالصلاة'), (0.8679752945899963, 'لصلاة'), (0.837247908115387, 'والصلاة'), (0.8103158473968506, 'وصلاة'), (0.807507336139679, 'بصلاة'), (0.8005672693252563, 'صلاة'), (0.7991220355033875, 'للصلاة'), (0.7942743897438049, 'كصلاة'), (0.7561188340187073, 'فصلاته'), (0.7496492862701416, 'بالصلاة')]
[(0.8781991600990295, 'صيام'), (0.8510888814926147, 'كصيام'), (0.8345058560371399, 'فصيام'), (0.8330397009849548, 'بصيام'), (0.8206717371940613, 'بالصيام'), (0.7958154678344727, 'وصيام'), (0.7822644114494324, 'صيامه'), (0.7596244812011719, 'وأيام'), (0.7555518746376038, 'الصوم'), (0.7435011267662048, 'الإطعام')]
[(0.9087247848510742, 'كالزكاة'), (0.8874373435974121, 'للزكاة'), (0.8786138892173767, 'والزكاة'), (0.8772928714752197, 'زكاة'), (0.8646826148033142, 'كزكاة'), (0.8360320925712585, 'وزكاة'), (0.7191078662872314, 'زكاته'), (0.6740335822105408, 'زكاتها'), (0.6733525395393372, 'الفطرة'), (0.6722766160964966, 'الغنى')]
[(0.8515108823776245, 'العمرة'), (0.8395167589187622,

In [30]:
# implement a function to extract contexual embeddings of words


^C

Note: you may need to restart the kernel to use updated packages.


ImportError: 
BertModel requires the PyTorch library but it was not found in your environment. Checkout the instructions on the
installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.
Please note that you may need to restart your runtime after installation.


## Character level features