# `TFIDF` Eras

The goal is to see how key words and phrases have changed and evolved over time. Are the schools stagnant in the way they address law? Are their new methods or view points? How do the individual schools differ, and do they evolve in the same way?

In [1]:
import io
import json
from pyarabic import araby
from nltk.corpus import stopwords
from stop_words import get_stop_words
from nltk.stem.isri import ISRIStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

## Splitting Up Eras
To split up books by era, the dict `time_ranges` defines how many books are in each of the 6 eras. Initially, I used 4 eras, but after a few trials and more careful consideration, I decided to go with 6.

In general, this is how the time frames are split up:

| Era |  Time Range |                      |
|:---:|:-----------:|:--------------------:|
|  0  | 0000 - 0200 | Initial Codification |
|  1  | 0200 - 0500 |    Schools Spread    |
|  2  | 0500 - 0700 |  Major Global Works  |
|  3  | 0700 - 0900 |     Commentaries     |
|  4  | 0900 - 1250 |  Indstrl Rev & Euro  |
|  5  | 1250 - 1450 |     Modern Times     |

In [2]:
categories = ['134','135','136','137']

time_ranges = {
    134: [7, 6, 10, 10, 10, 5],
    135: [2, 18, 12, 8, 17, 9],
    136: [2, 8, 15, 10, 16, 9],     # [1, 9, 15, 10, 16, 9]   Used time w/ al-Muzanī because of min_df in TFIDF 
    137: [6, 10, 12, 12, 20, 17]
}

## `texts_13*.json` vs `texts_13*_stemmed.json`
I looked at `word frequencies` with the same text stemmed with NLTK and not stemmed, and found that NLTK has many issues when it comes to properly stemming, thus I totally dropped that. Because of that, stemming had a large negative effect on the `word frequencies`.

In [3]:
all_books = []
for category_id in categories:
    with open('gensim_files/texts_'+str(category_id)+'.json') as f:
        books = json.load(f)
    all_books.append(books)

In [4]:
sw1 = get_stop_words('arabic') + stopwords.words("arabic")
sw2 = ['ا','أ','إ','ذ','ض','ص','ث','ق','ف','غ','ع','ه','خ','ح','ج','ش','س','ي','ب','ل','ا','ال','ت','ن','م','ك','ئ','ء','ؤ','ر','لا','ى','ة','و','ز','ظ']
sw3 = ["آله", "أبو", "أبي", "أثنى", "أحدهما", "أظهرهما", "أن", "أنه", "أو", "أى", "إلا", "إلخ", "إله", "إلى", "ابن", "ابو","ابي", "الآتي", "الأستاذ", "الأولباب", "الامام", "البصير", "الثانية", "الجلال", "الحمد", "الخ", "الدكتور", "الرحمن", "الرحيم", "الرسول", "السميع","الشارح", "الشيء", "الشيخ", "الصمد", "العبد", "العلامة", "العلي", "الفقير", "القدير", "الكتاب", "الله", "المؤلف", "المؤلفة", "المجلد","المسألة", "المصنف", "النبي", "الي", "انتهى", "انظر", "اهـ", "باب", "بالواو", "بدلالة", "برقم", "بسم", "بعد", "بقيد", "بمثابة", "بن", "به","بها", "بين", "بينهما", "تأمل", "تخريجه", "تعالى", "تعبيره", "ثم", "ثناء", "حتى", "حكاهما", "حمدا", "خصما", "دليلنا", "ذكرنا", "ذكرناه", "ر"]
sw4 = ['لأنهما','يحصل','قولهما','بدون','','وأنه','وروى','المقصود','أصلا','لوجود','الشرع','فقط','ولعل','اختلافهم','فقولان','فصلوأما','اليه','قدمنا','ثالثها','معنى','تقدم','والله','أنها','بلا','وأن','بفاس','منهم','أعلم','ففيه','وهل','أنها','وأن','ذكره','كلامه','قاله','نقله','منهما','بأنه','بنحو','ومحل','نقل','وجهين','فعلى','كون','وأن','أحد','بلا','','','','','','لقول','أنها','أخذ','ففي','ذكره','فاذا','ويدل','قيل','قالوا','القول','وجه','المعنى','وجهين','باعتبار','اعتبار','بينا','بأس','فلذلك','فلهذا','وقاله','تأويلان','القولين','بقوله','إليه','بذلك','شيئا','عنده','وذاك','لعدم','ومنهم','قولين','عبارة','زيادتي','وينبغي','ولهذا','أكان','وخبر','وحينئذ','رحمهم','فهنا','إليه','فلم','غيره','أيضا','ولسنا','جميعهم','وليسوا','الأوجه','التالية','وثالثها','قلت','وكذلك','وسلم','وقال','شيء','لأن','لأن','فهو','فقال','لأنه','رحمه','فلما','يكن','وابن','رسول','النبي','وقيل','وكذا','وإلا','ونحوه','واحد','فلو','الأول','بأن','والثاني','وجهان','قلنا','الله',"فقال","وعن","ربه", "رحمة", "رسول", "رضى", "رضي", "رقم", "رها", "سبحانه", "سنن", "سيأتي", "شرح", "شيخ", "صلى", "طبقات", "عبد", "على", "عليكم", "عليه", "عن","عند", "عنه", "غير", "فأشبه", "فأما", "فإن", "فإنا", "فإنه", "فائدة", "فافهم", "فالأصح", "فالقاضي", "فالوجه", "فان", "فانه", "فجاز", "فدل", "فصل","فكان", "فلأن", "فلأنه", "فلا", "فلما", "فليتأمل", "فمنهم", "فنقول", "فهذا", "فهل", "فوجهان", "فى", "في", "فيه", "فيها", "قال", "قبل", "قدمناه","قلنا", "قول", "قولان", "قوله", "كان", "كتاب", "كلام", "كلامهم", "كما", "كونه", "لأنا", "لأنه", "لأنها", "لان", "لانه", "لخبر", "لذلك", "لرحمة","لقوله", "له", "لهم", "لو", "مادة", "مثال", "مثلا", "عبدا", "مع", "معطوف", "مقدمة", "من", "مناهج", "منتهى", "منه", "نسخة", "نسلم", "نصا","نصه", "نقول", "هريرة", "ههنا", "وأصحهما", "وأما", "وأيضا", "وإن", "وإنما", "وإنه", "واحتج", "واعلم", "والتقوى", "والثانى", "والثاني","والسلام", "والصلاة", "وان", "وانظر", "وبالله", "وبه", "وتقدم", "وجزم", "وجل", "وجهان", "وسلم", "وسن", "وشرعا", "وصلى", "وعبارة", "وعلى","وعنه", "وغيره", "وغيرهم", "وفى", "وقال", "وقد", "وقدمه", "وقوله", "وقولي", "وقيل", "وكذلك", "ولأن", "ولأنه", "ولأنها", "ولذا", "ولكنا", "ولنا", "ولو","عنهم", "وهذا", "وهنا",'فذا','فهذ','هتحقيق','فاستخلفه','واعتبارا','وبالجملة','اليها','وذا',"وهو", "ويحتمل", "يعني", "يقال", "يقول", "يكون"]
sw = set(sw1+sw2+sw3+sw4)
    
def dummy(text):
    return text

# `TFIDF Vectorizer`
Now, it's important to know what TfidfVectorizer does. It looks at term frequency within a document.

Term frequency (tf) is the frequency of a certain term in a document:

$$
\mathrm{tf}(t,d) = \frac{N_\text{term}}{N_\text{terms in Document}}
$$
where

$N_\text{term}$ is the number of times a term/word $t$ appears in document $d$
$N_\text{terms in Document}$ is the number of terms/words in document $d$
Inverse document frequency (idf) is defined as the frequency of documents that contain that term over the whole corpus:

$$
\mathrm{idf}(t, D) = \log\frac{N_\text{Documents}}{N_\text{Documents that contain term}}
$$
where

- $N_\text{Documents}$ is the number of documents in the corpus $D$
- $N_\text{Documents that contain term}$ is the number of documents in $D$ that contain term/word $t$
TF-IDF is then calculated as:

$$
\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)
$$
The results penalizes common words, and rare words have more weight. The fact is, this is much more robust and methodical vectorizer than Count or Hash, which is why I didn't even bother to use the others. It helps the words from each subreddit stand out with much more weight.

## `Tuning`
- Since 2 schools have only `2` books `min_df` had to be set to `1`.
- Since many texts are filled with many concepts of law, quoting previous scholars, there is **a lot** of repetition. Thus `max_df` is set to `.7`.
- Since the books have been `pre-processed` in [00_create_texts_dicts_corpora.py](00_create_texts_dicts_corpora.py), there is no need to `tokenize` again. However, if any common `stop_words` are still present (due to issues in the text), they will be removed again.

In [5]:
tfidf = TfidfVectorizer(
        min_df=1,
        max_df=.85,
        stop_words=sw,
        analyzer='word',
        tokenizer=dummy,
        preprocessor=dummy,
        token_pattern=None)

In [None]:
all_tfidfs = []
word_freqs = []
for books,category_id in zip(all_books,categories):
    print('###############\nFrom Category {}\n###############'.format(category_id))
    start_time = 0
    end_time = 0
    time_range = time_ranges[int(category_id)]
    school_tfidfs = []
    
    for time in range(0,len(time_range)):
        end_time += time_range[time]
        print('TFIDF Books {} - {}.'.format(start_time,end_time))
        
        bow = tfidf.fit_transform(books[start_time:end_time])
        all_tfidfs.append(bow)
        sum_words = bow.sum(axis=0)
        words_freq = [(word, sum_words[0, idx]) for word, idx in tfidf.vocabulary_.items()]
        words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
        school_tfidfs.append(words_freq)
        start_time += time_range[time]
    word_freqs.append(school_tfidfs)
    
with open('../../data/all_word_freqs_85.json', 'w') as f:
        json.dump(word_freqs,f)

###############
From Category 134
###############
TFIDF Books 0 - 7.
TFIDF Books 7 - 13.
TFIDF Books 13 - 23.
TFIDF Books 23 - 33.
TFIDF Books 33 - 43.
TFIDF Books 43 - 48.
###############
From Category 135
###############
TFIDF Books 0 - 2.
TFIDF Books 2 - 20.
TFIDF Books 20 - 32.
TFIDF Books 32 - 40.
TFIDF Books 40 - 57.
TFIDF Books 57 - 66.
###############
From Category 136
###############
TFIDF Books 0 - 2.
TFIDF Books 2 - 10.
TFIDF Books 10 - 25.
TFIDF Books 25 - 35.
TFIDF Books 35 - 51.
TFIDF Books 51 - 60.
###############
From Category 137
###############
TFIDF Books 0 - 6.
TFIDF Books 6 - 16.
TFIDF Books 16 - 28.
TFIDF Books 28 - 40.
TFIDF Books 40 - 60.
TFIDF Books 60 - 77.
