# The Light in French Literature 2 - keyword analysis
## - The use of lighting technologies and sensations
Which writers make the most and least use of sensational words when writting about lighting technologies?

- how likely is it that a lighting technology is mentioned in a sentens containing “sensational vocabulary”?
- how many lighting technology sentences are there?
- how many of them contain sensational vocab?

    1. split text in sentencens
    2. count sentencens
    3. count sentencens with lighting technology
    4. count sentencens with lighting technology containing “sensational vocabulary”
    5. caculate ratio of total sentencens and sentencens with lighting technology containing “sensational vocabulary”
    6. caculate ratio of sentencens with lighting technology and sentencens with lighting technology containing “sensational vocabulary”
    



In [None]:
import re
import os
import pandas as pd
from pathlib import Path
from tqdm.notebook import tqdm
tqdm.pandas()

# Prepare data
- Data is stored in a dataframe
- in a function we detect presens of keywords in paragraphs. To do this we split the text into paragraphs and check then one by for a keyword. If a keyword is detected in a paragraph we do not continue to search for other keywords in the same paragraph. We just register, that a keyword is found in a paragraph and moves on to the next paragraph.

In [None]:
input_dir = Path.cwd() / '../data/csv_files' # path of files to be found
csv_files = os.listdir(input_dir)
print (f'The file name is {csv_files[-1]}')
df = pd.read_csv(input_dir / csv_files[-1], sep='|')



# split text in paragraphs and count paragraphs
def count_sentences(text):
    sent = text.split('.')
    sent = [s.strip() for s in sent]
    return len(sent)

print ('\n\nCount paragraphs:\n')
df['count_paragraphs'] = df['text'].progress_apply(lambda text : count_sentences(text))


#### Read two lists of keywords - technology and emotion keyowrds ####

# path to keyword lists directory 
input_dir = Path.cwd() / '../data/key_word_lists' 


# get the tech words
key_word_file_name = 'technology_list.txt'
with open(input_dir / key_word_file_name, 'r', encoding='utf-8-sig') as file:
    tech_key_words = file.read().split('\n')
    
# get the emo words    
key_word_file_name = 'sensation_list.txt'
with open(input_dir / key_word_file_name, 'r', encoding='utf-8-sig') as file:
    emo_key_words = file.read().split('\n')



    
### Build function to do the teck key word counting
def count_kw(text, key_word_list):
    text = text.lower()
    sentencens = text.split('.')
    list_of_sents = [s.strip() for s in sentences]
    
    count = 0
    # take paragraphs one by one
    for sent in list_of_sents:
        # take every word in the list of keywords
        for key_word in key_word_list:
            # if a keyword is in the paragraph return boolean
            if re.search(key_word, sent):
                # then add one to the counter 
                count = count + 1
                # and then break out and return to the beginning of the loop 
                break
    return count

def count_kw_in_kw_sent(text, key_word_list1, key_word_list2):
    text = text.lower()
    sentences = text.split('.')
    list_of_sents = [s.strip() for s in sentences]
    
    count = 0
    # take paragraphs one by one
    for sent in list_of_sents:
        
        # take every word in the list of tech words
        for key_word in key_word_list1:
            # if a word from the list is in the paragraph
            if re.search(key_word, sent):
                for key_word in key_word_list2:
                    if key_word in sent:
                        # then add one to the counter 
                        count = count + 1
                        # and then break out and return to the beginning of the loop 
                        break
    return count



######### Count keyword in paragraphs ##### 

print ('\n\nCount paragraphs with tech keywords:\n')    
df['p_lightning_kw'] = df['text'].progress_apply(lambda text : count_kw(text, tech_key_words))

######### Count paragraphs with lighting technology containing “emotion vocabulary”
print ('\n\nCount tech paragraphs that holds sensation keywords:\n') 
df['p_lightning_emotion_kw'] = df['text'].progress_apply(lambda text : count_kw_in_kw_sent(text, tech_key_words, emo_key_words))


In [None]:
df

In [None]:
# get relative data
new_df = df.copy()
new_df['lightning technology paragraphs share of all paragraphs'] = new_df['p_lightning_kw'] / new_df['count_paragraphs'] 
new_df['emo and lightning technology paragraph share of all paragraphs'] = new_df['p_lightning_emotion_kw'] / new_df['count_paragraphs'] 
new_df['emo and lightning technology paragraph share of lightning technology paragraph'] = new_df['p_lightning_emotion_kw'] / new_df['p_lightning_kw']

new_df = new_df.iloc[:, [0, 1,2,3,6,7,8,9,10]]
new_df

In [None]:
import plotly.express as px

In [None]:
x_var = new_df.index + 1
y_var = 'emo and lightning technology paragraph share of all paragraphs'


fig = px.bar(new_df.sort_values(by=y_var, ascending=False), 
            x = x_var, 
            y = y_var,
            hover_data=['author','title', 'year', 'emo and lightning technology paragraph share of all paragraphs'],
            title= 'Relative frequency of paragraphs')
             
# Update x-axis label
fig.update_xaxes(title_text='Document in chronological order')

# Update y-axis label
fig.update_yaxes(title_text='Relative frequence') 

fig.show()

In [None]:
import plotly.io as io

html_snippet_start = '<!DOCTYPE html> <html> <head> <title>Title</title> </head> <body>' 
html_snippet_end = ' </body></html> '

html_as_string = io.to_html(fig, full_html=False)

vis_in_html = html_snippet_start + html_as_string + html_snippet_end

of = open(r'C:\Users\lakj\Lighting in French Literature\visualisations\emo_in_light_para2.htm', 'w', encoding='utf-8-sig')
of.write(vis_in_html)

# Keyword in context - or find a text snippet based on keywords and a range

We want to find a word for example 'lumière' as well as words that are related to the word, and we have to have some context because we are actually interested in pointing down the text and seeing exactly how lumière is used.

For this we need to use \w. because it gives us more word characters and {30} checks that we get 30 word characters before we hit the letters lumière. \b in front of lumière searches so that we only find words that begin with lumière and not words where lumière is part of the word, e.g. looking. After lumière, \w.{30} searches for another 30 word characters.

The pipe , | , means 'or' and alow us to get more than one word in the same search.

Below I choose to look for contexts in texts of the author Balzac.

In [None]:
text = ' '.join(df[df['author'] == 'Balzac']['clean_text'])
import re
context = re.findall(r'.{0,50}\blampe.{0,40}|.{0,50}\blustre.{0,40}', text)
context