# USING FRIDAY SERMONS AS DATA

**Ihsan Kahveci**

## BACKGROUND AND CONTEXT 

Friday Sermons (Khutbe) is a one-two page long test that includes certain verses from Quran and Hadith plus a weekly agenda or information for the Muslim society. Those agendas range from daily practices like personal hygiene to popular issues like Black Friday. Hutbe texts must be read out loud by imams during the Friday pray **in all of the 88000 mosques in Turkey and it is enacted by law that imams must stay on the text.**


Presidency of Religious Affairs (Diyanet Isleri Baskanligi a.k.a Diyanet) is a state institution in founded on May 3, 1924 after the Kemalist Secular revolution in Turkey. The aim was to prevent Arabic influence on Turkish Muslim majority by nationalizing Islamic doctrine and practices. It is one of the biggest government institutions with a budget of almost $2 billion. **One of the main objects of Diyanet is to prepare and distribute Friday Sermons (hutbe) to the imams of all the mosques in Turkey.**

A study undertaken by the DİB in 2014 with a sample size of 21.632 individuals and randomly selected based on census data revealed that **56% percent of the male population reported regularly attending to Friday prayers (DİB, 2014)**. This amount increased to 65% in rural areas. In addition, 15% percent of the population stated that they are attending Friday prayers in an irregular basis.

There are other Islamic institutions across the world that are also responsible for preparation and distribution of the Friday sermons such as Ahmadiyya Muslim Community in United Kingdom and Islamic Society of Boston Cultural Center (ISBCC) in US, both are long-established NGOs and heavily affiliated with the Muslim community in their regions. There are also other state institutions like General Authority of Islamic Affairs & Endowments in United Arab Emirates and Alharamain in Saudi Arabia.

## OBJECTIVES: 

### The broader aim of my project: 
I am planning to conduct a comparative content analysis research on the Hutbe texts produced by organizations above and hopefully more. The requirements for a sermon archive to be included in the dataset is whether they are produced on a weekly basis over 10-year period. The second requirement is the availability of English version or long summary.

The comparison will be made around the question below: 
- Does the weekly content respond to key political events regarding Muslims in the region?
- Differences in responding to world-wide events like 9/11 or Charlie Hebdo?
- Is there a trend towards and against radicalization?  


### Hypotheses: 

- **H1: There will be a distinctive topic that represents nationalim.**

- **H2: The prevalance of nationalism topic will increase over time.**

### The goals for this course:

- Using LDA Topic Modelling to come up with distinctive topics. 
- Plot graphs to understand change over time.
- Test my hypotheses.
- Basically, writing the first draft of the methodology chapter. 

In [1]:
import pandas as pd
import numpy as np
import gensim
from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import * 
import nltk
import matplotlib.pyplot as plt

np.random.seed(5757)


## Data Preparation

This is my sample dataset that I scraped from Diyanet's website:

In [2]:
documents = pd.read_csv('sermons_df', sep=';', index_col=0)
documents

Unnamed: 0,date,sermon
0,2019-02-01,\nDATE: 01.02.2019\n [pic]RIFQ (GENTLENE...
1,2018-03-02,\nLOCATION : NATIONWIDE\nDATE : 02.03.2018...
2,2019-12-27,\nDATE: 27.12.2019\n ...
3,2017-08-25,LOCATION\t: NATIONWIDE\n\n\tDATE\t: 25.08.2017...
4,2019-10-25,\nDATE: 25.10.2019\n[pic]\n\n FOR...
...,...,...
152,2019-04-12,\nDATE: 12.04.2019\n ...
153,2017-09-01,\nLOCATION : NATIONWIDE\nDATE : 01.09.2017...
154,2017-02-03,\n LOCATION : NATIONWIDE\n DATE : 03....
155,2019-08-23,\nDATE: 23.08.2019\n[pic]\n SOCIAL HARMS OF...


In [3]:
stemmer = SnowballStemmer('english')
custom_words = ['date', 'location', 'nationwide']
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3 and token not in custom_words:
            result.append(lemmatize_stemming(token))
    return result

In [4]:
doc_sample = documents.sermon[57]

In [5]:
print("\033[1m" + "Original 57th Sermon: " + "\033[0;0m")
print(' ')
print(doc_sample)

[1mOriginal 57th Sermon: [0;0m
 

LOCATION  : NATIONWIDE
DATE             : 13.01.2017

                                                                       [pic]

                   TO BE ABLE TO BE A SERVANT ALLAH LOVES

    A Blessed Friday to You, Brothers and Sisters!
    In the verse I have recited, Allah Almighty  shows  us  the  way  to  be
honored with His love and favor: "Say, [O Muhammad],  'If  you  should  love
Allah, then follow me, so Allah will love you and  forgive  you  your  sins.
And Allah is Forgiving and Merciful.'"[i]
    Brothers and Sisters!
    The Almighty stated the deeds that will move us away from His  grace  in
His Book like He showed us in it the ones that will bring us to  His  favor.
He stated the ones honored with His love as well as those left  without  His
love and mercy. Come, let's hear in today's khutbah  who  the  servants  our
Lord loves are.
    ?????????  ???????  ????????????????  Allah  loves  those  pure  in  the
spiritual and material

In [6]:
print("\033[1m" + "Tokenized and Lemmatized 57th Sermon: " + "\033[0;0m")
print(' ')
print(preprocess(doc_sample))

[1mTokenized and Lemmatized 57th Sermon: [0;0m
 
['abl', 'servant', 'allah', 'love', 'bless', 'friday', 'brother', 'sister', 'vers', 'recit', 'allah', 'almighti', 'show', 'honor', 'love', 'favor', 'muhammad', 'love', 'allah', 'follow', 'allah', 'love', 'forgiv', 'sin', 'allah', 'forgiv', 'merci', 'brother', 'sister', 'almighti', 'state', 'deed', 'away', 'grace', 'book', 'like', 'show', 'one', 'bring', 'favor', 'state', 'one', 'honor', 'love', 'leav', 'love', 'merci', 'come', 'hear', 'today', 'khutbah', 'servant', 'lord', 'love', 'allah', 'love', 'pure', 'spiritu', 'materi', 'sens', 'come', 'protect', 'disposit', 'evil', 'mind', 'heart', 'center', 'good', 'nice', 'thing', 'evil', 'ugli', 'allah', 'love', 'repent', 'honor', 'brother', 'sister', 'come', 'servant', 'repent', 'voic', 'submiss', 'lord', 'remors', 'sin', 'refug', 'vast', 'merci', 'forget', 'repent', 'like', 'start', 'life', 'allah', 'love', 'avoid', 'disobey', 'come', 'awar', 'duti', 'oblig', 'live', 'live', 'accord', 'purp

In [7]:
processed_docs = documents['sermon'].map(preprocess)
processed_docs[57]

['abl',
 'servant',
 'allah',
 'love',
 'bless',
 'friday',
 'brother',
 'sister',
 'vers',
 'recit',
 'allah',
 'almighti',
 'show',
 'honor',
 'love',
 'favor',
 'muhammad',
 'love',
 'allah',
 'follow',
 'allah',
 'love',
 'forgiv',
 'sin',
 'allah',
 'forgiv',
 'merci',
 'brother',
 'sister',
 'almighti',
 'state',
 'deed',
 'away',
 'grace',
 'book',
 'like',
 'show',
 'one',
 'bring',
 'favor',
 'state',
 'one',
 'honor',
 'love',
 'leav',
 'love',
 'merci',
 'come',
 'hear',
 'today',
 'khutbah',
 'servant',
 'lord',
 'love',
 'allah',
 'love',
 'pure',
 'spiritu',
 'materi',
 'sens',
 'come',
 'protect',
 'disposit',
 'evil',
 'mind',
 'heart',
 'center',
 'good',
 'nice',
 'thing',
 'evil',
 'ugli',
 'allah',
 'love',
 'repent',
 'honor',
 'brother',
 'sister',
 'come',
 'servant',
 'repent',
 'voic',
 'submiss',
 'lord',
 'remors',
 'sin',
 'refug',
 'vast',
 'merci',
 'forget',
 'repent',
 'like',
 'start',
 'life',
 'allah',
 'love',
 'avoid',
 'disobey',
 'come',
 'awar',


In [8]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [9]:
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

In [10]:
tokens = []
count = 0
for k, v in dictionary.iteritems():
    tokens.append(k)
    print(k, v)
    count += 1



0 abandon
1 adab
2 advis
3 affair
4 anger
5 approach
6 attack
7 attempt
8 avoid
9 away
10 basi
11 be
12 bear
13 behav
14 behavior
15 bestow
16 better
17 birr
18 break
19 bukhari
20 calm
21 cast
22 caus
23 characterist
24 children
25 command
26 commit
27 compass
28 compassion
29 conduct
30 conscienc
31 consider
32 construct
33 consult
34 curs
35 damag
36 dawud
37 deed
38 depriv
39 devot
40 difficulti
41 enabl
42 enjoin
43 enmiti
44 equal
45 etern
46 evil
47 exampl
48 excit
49 exist
50 express
51 extrem
52 face
53 fall
54 fear
55 forgiv
56 friend
57 frustrat
58 fussilat
59 geographi
60 get
61 grace
62 gracious
63 have
64 head
65 high
66 hold
67 homeland
68 humbl
69 hurt
70 ignor
71 imran
72 increas
73 instead
74 insult
75 judgment
76 kind
77 languag
78 loyalti
79 manner
80 matter
81 mind
82 mistak
83 nation
84 need
85 neglect
86 neighbor
87 news
88 nobl
89 oppress
90 paradis
91 parent
92 patient
93 pieti
94 presenc
95 punish
96 read
97 refrain
98 regardless
99 relat
100 religion
101 repr

961 wave
962 asset
963 burden
964 complet
965 confid
966 countless
967 devast
968 disrespect
969 farewel
970 feet
971 hard
972 involv
973 lazi
974 multipli
975 owner
976 particular
977 permit
978 price
979 rich
980 satan
981 sermon
982 uncl
983 ungrat
984 usuri
985 wick
986 eat
987 healthi
988 indispens
989 introduc
990 listen
991 mental
992 set
993 suit
994 wait
995 anfal
996 astray
997 benefici
998 concess
999 convey
1000 goal
1001 lifestyl
1002 main
1003 pilgrimag
1004 pure
1005 real
1006 reflect
1007 select
1008 stanc
1009 stray
1010 wit
1011 birth
1012 conflict
1013 cultur
1014 distanc
1015 especi
1016 find
1017 hardship
1018 jugular
1019 livabl
1020 meaning
1021 nourish
1022 persecut
1023 persist
1024 primarili
1025 refresh
1026 reput
1027 sad
1028 spoil
1029 task
1030 term
1031 veriti
1032 vital
1033 abid
1034 defam
1035 difficult
1036 disturb
1037 divis
1038 forc
1039 hatr
1040 hujurat
1041 idah
1042 improv
1043 just
1044 minbar
1045 mischief
1046 offens
1047 quran
1048 second


In [11]:
len(tokens)

1323

In [12]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[57]

[(3, 1),
 (8, 3),
 (9, 2),
 (15, 2),
 (32, 1),
 (37, 1),
 (46, 2),
 (55, 2),
 (61, 2),
 (71, 1),
 (81, 1),
 (83, 2),
 (89, 4),
 (92, 2),
 (107, 1),
 (115, 1),
 (121, 1),
 (130, 1),
 (131, 1),
 (140, 3),
 (161, 1),
 (166, 2),
 (167, 1),
 (203, 1),
 (233, 4),
 (234, 1),
 (240, 1),
 (249, 1),
 (254, 2),
 (265, 1),
 (276, 1),
 (304, 2),
 (306, 1),
 (309, 2),
 (316, 1),
 (324, 1),
 (334, 2),
 (350, 1),
 (360, 3),
 (366, 1),
 (368, 1),
 (369, 3),
 (374, 1),
 (393, 1),
 (405, 2),
 (412, 1),
 (417, 1),
 (424, 1),
 (431, 1),
 (435, 1),
 (450, 1),
 (453, 3),
 (469, 1),
 (475, 1),
 (489, 4),
 (496, 2),
 (508, 2),
 (517, 1),
 (553, 1),
 (559, 2),
 (560, 1),
 (568, 1),
 (579, 1),
 (588, 2),
 (598, 1),
 (603, 2),
 (606, 1),
 (611, 1),
 (653, 1),
 (660, 1),
 (678, 1),
 (692, 1),
 (716, 1),
 (739, 1),
 (746, 1),
 (751, 1),
 (752, 1),
 (766, 2),
 (769, 1),
 (782, 1),
 (785, 1),
 (794, 1),
 (808, 4),
 (820, 1),
 (833, 1),
 (861, 1),
 (882, 1),
 (884, 1),
 (891, 1),
 (903, 1),
 (908, 1),
 (959, 1),
 (966

### Bag of Words
In the end we created a bag of words corpus of each sermon documents. If we go back to our sample sermon which is the 57th sermon, we can read the bag of words matrix for as below:

In [13]:
bow_doc_57 = bow_corpus[57]

for i in range(len(bow_doc_57)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_57[i][0], 
                                                     dictionary[bow_doc_57[i][0]], 
                                                     bow_doc_57[i][1]))

Word 3 ("affair") appears 1 time.
Word 8 ("avoid") appears 3 time.
Word 9 ("away") appears 2 time.
Word 15 ("bestow") appears 2 time.
Word 32 ("construct") appears 1 time.
Word 37 ("deed") appears 1 time.
Word 46 ("evil") appears 2 time.
Word 55 ("forgiv") appears 2 time.
Word 61 ("grace") appears 2 time.
Word 71 ("imran") appears 1 time.
Word 81 ("mind") appears 1 time.
Word 83 ("nation") appears 2 time.
Word 89 ("oppress") appears 4 time.
Word 92 ("patient") appears 2 time.
Word 107 ("seek") appears 1 time.
Word 115 ("stand") appears 1 time.
Word 121 ("today") appears 1 time.
Word 130 ("accept") appears 1 time.
Word 131 ("accord") appears 1 time.
Word 140 ("awar") appears 3 time.
Word 161 ("decept") appears 1 time.
Word 166 ("duti") appears 2 time.
Word 167 ("earn") appears 1 time.
Word 203 ("look") appears 1 time.
Word 233 ("servant") appears 4 time.
Word 234 ("short") appears 1 time.
Word 240 ("spiritu") appears 1 time.
Word 249 ("thing") appears 1 time.
Word 254 ("trust") appears 

In [14]:
def findK (corpus=bow_corpus, num_topics=10, id2word=dictionary):
    for i in num_topics:
        lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=i, id2word=dictionary, 
                                               passes=20, workers=2, random_state=57)
        print( "\033[1m" + "Number of topics: " + str(i) + "\033[0;0m")
        for idx, topic in lda_model.print_topics(-1):
            print('Topic: {} \nWords: {}'.format(idx + 1, topic))
        
    return(lda_model)

## Deciding number of topics:

After the initial trial I found out the range of k (amount of topics) is between 5 and 20. Then, I run an LDA algorithm for each value from 5 to 20 and compared the equations. For the purpose of better presentation I only added the outputs between 6 and 10 here. If you look closely the topic that represents nationalism and how it changed as k increases, you will rease two important things:

- With k = 6, we have a topic that includes a lot of keywords of nationalism but also some major religious keywords.
- With k = 10, we have generated more than one topic that represents nationalism and we also have saturated our categories with bunch of additional religious topics. 
- With k = 8, we reached the optimum point where our topics are very distinctive and the nationalism topic includes very important keywords like "jihad" and "canakkale", latter is one of the most important historical wars in the modern Turkish history

In [15]:
findK(num_topics=range(6,11,1))

[1mNumber of topics: 6[0;0m
Topic: 1 
Words: 0.012*"night" + 0.012*"forgiv" + 0.012*"repent" + 0.010*"deed" + 0.010*"truth" + 0.010*"servant" + 0.008*"usuri" + 0.008*"trade" + 0.008*"pbuh" + 0.008*"sin"
Topic: 2 
Words: 0.029*"month" + 0.028*"ramadan" + 0.016*"sacrific" + 0.011*"fast" + 0.011*"reach" + 0.010*"share" + 0.008*"gratitud" + 0.007*"year" + 0.007*"night" + 0.007*"need"
Topic: 3 
Words: 0.033*"mosqu" + 0.025*"children" + 0.024*"famili" + 0.020*"knowledg" + 0.010*"teach" + 0.010*"trust" + 0.009*"place" + 0.008*"week" + 0.008*"build" + 0.007*"companion"
Topic: 4 
Words: 0.012*"evil" + 0.010*"deed" + 0.009*"word" + 0.008*"person" + 0.007*"seek" + 0.006*"iman" + 0.005*"need" + 0.005*"help" + 0.005*"leav" + 0.005*"respect"
Topic: 5 
Words: 0.012*"sunnah" + 0.012*"famili" + 0.011*"religion" + 0.007*"respect" + 0.007*"word" + 0.007*"children" + 0.007*"marriag" + 0.006*"person" + 0.006*"societi" + 0.006*"book"
Topic: 6 
Words: 0.020*"nation" + 0.013*"oppress" + 0.011*"friday" + 0.0

<gensim.models.ldamulticore.LdaMulticore at 0x1a234664d0>

## LDA Model with 8 topics:

This is the equation of our model. Later, there will be an interactive visualizaiton that presents everytrhing more user-friendly. Note, I assigned _passes=20_ because I have very few number of documents and I want my model to read each document 20 times before creating topics.

In [16]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=8, id2word=dictionary,
                                       passes=20, workers=2, random_state=57)

for idx, topic in lda_model.print_topics(-1):
            print('Topic: {} \nWords: {}'.format(idx+1, topic))

Topic: 1 
Words: 0.015*"forgiv" + 0.014*"repent" + 0.012*"truth" + 0.012*"night" + 0.011*"usuri" + 0.011*"trade" + 0.010*"servant" + 0.010*"sin" + 0.008*"deed" + 0.008*"earn"
Topic: 2 
Words: 0.033*"month" + 0.030*"ramadan" + 0.015*"sacrific" + 0.012*"fast" + 0.011*"reach" + 0.009*"night" + 0.009*"gratitud" + 0.009*"share" + 0.008*"year" + 0.006*"zakat"
Topic: 3 
Words: 0.048*"mosqu" + 0.030*"knowledg" + 0.023*"children" + 0.013*"teach" + 0.012*"place" + 0.011*"civil" + 0.010*"book" + 0.009*"learn" + 0.009*"read" + 0.008*"build"
Topic: 4 
Words: 0.013*"evil" + 0.011*"deed" + 0.008*"iman" + 0.008*"word" + 0.007*"seek" + 0.006*"person" + 0.006*"need" + 0.006*"help" + 0.006*"sincer" + 0.005*"best"
Topic: 5 
Words: 0.019*"sunnah" + 0.015*"religion" + 0.013*"word" + 0.010*"book" + 0.009*"truth" + 0.009*"forget" + 0.007*"lie" + 0.007*"command" + 0.006*"enjoin" + 0.006*"harm"
Topic: 6 
Words: 0.016*"friday" + 0.015*"oppress" + 0.013*"children" + 0.010*"violenc" + 0.008*"masjid" + 0.007*"khutb

In [17]:
def format_topics_sentences(ldamodel, corpus, texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

In [18]:
df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=bow_corpus, texts=processed_docs)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
df_dominant_topic.head(10)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,5.0,0.5447,"friday, oppress, children, violenc, masjid, kh...","[rifq, gentl, allah, love, gracious, affair, h..."
1,1,0.0,0.5938,"forgiv, repent, truth, night, usuri, trade, se...","[ethic, commerc, islam, jumu, mubarak, dear, b..."
2,2,1.0,0.6066,"month, ramadan, sacrific, fast, reach, night, ...","[life, contempl, honor, muslim, vers, recit, a..."
3,3,1.0,0.9962,"month, ramadan, sacrific, fast, reach, night, ...","[sacrific, search, close, allah, honor, believ..."
4,4,6.0,0.9966,"nation, homeland, martyr, victori, caus, bodi,...","[soldier, mehmetcik, prayer, dear, muslim, ble..."
5,5,2.0,0.5535,"mosqu, knowledg, children, teach, place, civil...","[messag, revel, human, read, jumu, mubarak, ho..."
6,6,3.0,0.4692,"evil, deed, iman, word, seek, person, need, he...","[funer, etiquett, duti, travel, etern, life, j..."
7,7,3.0,0.6251,"evil, deed, iman, word, seek, person, need, he...","[worship, spiritu, world, jumu, mubarak, honor..."
8,8,7.0,0.6545,"famili, trust, marriag, respect, relat, greet,...","[preserv, famili, environ, righteous, peac, fu..."
9,9,6.0,0.5493,"nation, homeland, martyr, victori, caus, bodi,...","[juli, rebirth, nation, jumu, mubarak, honor, ..."


The classification of each sermon by dominant topic and the relavant words inside the text can be seen as follow:

In [27]:
df_dominant_topic.head(30)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,5.0,0.5447,"friday, oppress, children, violenc, masjid, kh...","[rifq, gentl, allah, love, gracious, affair, h..."
1,1,0.0,0.5938,"forgiv, repent, truth, night, usuri, trade, se...","[ethic, commerc, islam, jumu, mubarak, dear, b..."
2,2,1.0,0.6066,"month, ramadan, sacrific, fast, reach, night, ...","[life, contempl, honor, muslim, vers, recit, a..."
3,3,1.0,0.9962,"month, ramadan, sacrific, fast, reach, night, ...","[sacrific, search, close, allah, honor, believ..."
4,4,6.0,0.9966,"nation, homeland, martyr, victori, caus, bodi,...","[soldier, mehmetcik, prayer, dear, muslim, ble..."
5,5,2.0,0.5535,"mosqu, knowledg, children, teach, place, civil...","[messag, revel, human, read, jumu, mubarak, ho..."
6,6,3.0,0.4692,"evil, deed, iman, word, seek, person, need, he...","[funer, etiquett, duti, travel, etern, life, j..."
7,7,3.0,0.6251,"evil, deed, iman, word, seek, person, need, he...","[worship, spiritu, world, jumu, mubarak, honor..."
8,8,7.0,0.6545,"famili, trust, marriag, respect, relat, greet,...","[preserv, famili, environ, righteous, peac, fu..."
9,9,6.0,0.5493,"nation, homeland, martyr, victori, caus, bodi,...","[juli, rebirth, nation, jumu, mubarak, honor, ..."


In [19]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [20]:
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary=lda_model.id2word)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### Visualizing LDA model:

This is Mako's favorite visualization of topic models. It is a lot easier than it seems. 

In [21]:
vis

In [22]:
#pyLDAvis.show(vis)


Based on my observations of topics and closer-reading of relavant texts, I have named the 8 topics as follows: 

- belief: Abstract concepts about iman (belief), the existence of evil and meaning of being a muslim. 
- praying: Concrete discussions about proper praying, the code of conduct for Friday prays.
- holy months: includes ramadan, sacrifice, fasting and suggestions for praying in holy months.
- salvation: includes forgiveness and mercifullness of god, how to properly ask for salvation.
- family: suggestions for marriage, solving family problems and gender roles.
- nationalism: Annotations of nation and history, sacredness of homeland and martry and spesific contexts about jihad and historic wars.
- lifestyle: sunnah stands for the life practises of the prophet as an exampler for a proper religious lifestyle. Also includes words like rules and haram (things are not-permited by god).
- education: explain the importance of summer schools takes place in mosques to provide religious education to children.


In [23]:
topics_for_docs = {}
for doc_num, document in enumerate(lda_model.get_document_topics(bow_corpus, minimum_probability=0)):
    topics_for_docs[doc_num] = {str(x[0]): x[1] for x in document}
    #list_of_dfs.append(pd.DataFrame())
    
topics_for_docs   


{0: {'0': 0.000703492,
  '1': 0.00070314313,
  '2': 0.00070299604,
  '3': 0.4510726,
  '4': 0.0007035879,
  '5': 0.54470706,
  '6': 0.00070349016,
  '7': 0.0007036025},
 1: {'0': 0.5938339,
  '1': 0.0005587598,
  '2': 0.0005593166,
  '3': 0.14997645,
  '4': 0.17598717,
  '5': 0.00055895234,
  '6': 0.077966355,
  '7': 0.0005591038},
 2: {'0': 0.0507477,
  '1': 0.6066169,
  '2': 0.0005008592,
  '3': 0.00050098606,
  '4': 0.00050097733,
  '5': 0.16924465,
  '6': 0.0005009576,
  '7': 0.17138699},
 3: {'0': 0.0005465769,
  '1': 0.9961741,
  '2': 0.0005464607,
  '3': 0.00054679974,
  '4': 0.0005466899,
  '5': 0.0005465209,
  '6': 0.00054644677,
  '7': 0.0005464076},
 4: {'0': 0.0004890686,
  '1': 0.0004891552,
  '2': 0.0004887306,
  '3': 0.00048917806,
  '4': 0.00048880413,
  '5': 0.00048907916,
  '6': 0.99657696,
  '7': 0.000489005},
 5: {'0': 0.00049883645,
  '1': 0.00049871416,
  '2': 0.55354005,
  '3': 0.00049900456,
  '4': 0.44346684,
  '5': 0.0004989096,
  '6': 0.0004988492,
  '7': 0.0

In [24]:
topics_df = pd.DataFrame(topics_for_docs).transpose()
topics_df.columns = ['belief', 'praying', 'holy months', 'salvation','family','nationalism' ,'lifestyle','education']
topics_df.insert(0, 'date', documents.date)
topics_df.date = pd.to_datetime(topics_df.date)
topics_df

Unnamed: 0,date,belief,praying,holy months,salvation,family,nationalism,lifestyle,education
0,2019-02-01,0.000703,0.000703,0.000703,0.451073,0.000704,0.544707,0.000703,0.000704
1,2018-03-02,0.593834,0.000559,0.000559,0.149976,0.175987,0.000559,0.077966,0.000559
2,2019-12-27,0.050748,0.606617,0.000501,0.000501,0.000501,0.169245,0.000501,0.171387
3,2017-08-25,0.000547,0.996174,0.000546,0.000547,0.000547,0.000547,0.000546,0.000546
4,2019-10-25,0.000489,0.000489,0.000489,0.000489,0.000489,0.000489,0.996577,0.000489
...,...,...,...,...,...,...,...,...,...
152,2019-04-12,0.000670,0.000669,0.000669,0.995313,0.000670,0.000669,0.000670,0.000669
153,2017-09-01,0.595874,0.000646,0.000645,0.221821,0.118348,0.061376,0.000645,0.000646
154,2017-02-03,0.000574,0.000574,0.000574,0.000574,0.169881,0.142938,0.000574,0.684310
155,2019-08-23,0.931653,0.000485,0.000485,0.065435,0.000485,0.000485,0.000485,0.000485


## Final Output: 

AS I stated earlier, I am interested in seeing the change of the prrevalance of certain topics over time instead of  the dominant topic for each text. As a result, I have created the following tables for each document. Each row represents one sermon including the date of the sermon and the probability of that sermons to belong the particular topics in each column. I will use these table to graph timeseries for each topic

In [25]:
topics_df

Unnamed: 0,date,belief,praying,holy months,salvation,family,nationalism,lifestyle,education
0,2019-02-01,0.000703,0.000703,0.000703,0.451073,0.000704,0.544707,0.000703,0.000704
1,2018-03-02,0.593834,0.000559,0.000559,0.149976,0.175987,0.000559,0.077966,0.000559
2,2019-12-27,0.050748,0.606617,0.000501,0.000501,0.000501,0.169245,0.000501,0.171387
3,2017-08-25,0.000547,0.996174,0.000546,0.000547,0.000547,0.000547,0.000546,0.000546
4,2019-10-25,0.000489,0.000489,0.000489,0.000489,0.000489,0.000489,0.996577,0.000489
...,...,...,...,...,...,...,...,...,...
152,2019-04-12,0.000670,0.000669,0.000669,0.995313,0.000670,0.000669,0.000670,0.000669
153,2017-09-01,0.595874,0.000646,0.000645,0.221821,0.118348,0.061376,0.000645,0.000646
154,2017-02-03,0.000574,0.000574,0.000574,0.000574,0.169881,0.142938,0.000574,0.684310
155,2019-08-23,0.931653,0.000485,0.000485,0.065435,0.000485,0.000485,0.000485,0.000485


In [26]:
topics_df.to_csv('topics_csv', sep=';', index = False, header=True)

## Prevalance of topics over time: 

The graphs are area graph. It is probably not the best way to visualize this data since there are many zero points over the years. 

![graph](Rplot06.png "nationalism")
![graph](Rplot01.png "belief")
![graph](Rplot02.png "praying")
![graph](Rplot03.png "holy months")
![graph](Rplot04.png "salvation")
![graph](Rplot05.png "family")
![graph](Rplot07.png "lifestyle")
![graph](Rplot08.png "education")


### Future work: 

- Compare the Turkish and English LDA models. Important for increasing date size.
- Use the time series plots to do close readings.

