# The Light in French Literature 2 - keyword analysis
## - The use of lighting technologies and emotions
Which writers make the most and least use of lighting technologies when writing about emotions?

- how likely is it that a lighting technology is mentioned in a paragraph containing “emotion vocabulary”?
- how many lighting technology paragraphs are there?
- how many of them contain emotion vocab?

    1. split text in paragraphs
    2. count paragraphs
    3. count paragraphs with lighting technology
    4. count paragraphs with lighting technology containing “emotion vocabulary”
    5. caculate ratio of total paragraphs and paragraphs with lighting technology containing “emotion vocabulary”
    6. caculate ratio of paragraphs with lighting technology and paragraphs with lighting technology containing “emotion vocabulary”
    



In [100]:
import re
import os
import pandas as pd
from pathlib import Path
from tqdm.notebook import tqdm
tqdm.pandas()

# Prepare data
- Data is stored in a dataframe
- in a function we detect presens of keywords in paragraphs. To do this we split the text into paragraphs and check then one by for a keyword. If a keyword is detected in a paragraph we do not continue to search for other keywords in the same paragraph. We just register, that a keyword is found in a paragraph and moves on to the next paragraph.

In [101]:
input_dir = Path.cwd() / '../data/csv_files' # path of files to be found
csv_files = os.listdir(input_dir)
print (f'The file name is {csv_files[-1]}')
df = pd.read_csv(input_dir / csv_files[-1], sep='|')



# split text in paragraphs and count paragraphs
def count_paragraphs(text):
    paragraphs = text.split('.')
    paragraphs = [p.strip() for p in paragraphs]
    return len(paragraphs)

print ('\n\nCount paragraphs:\n')
df['count_paragraphs'] = df['text'].progress_apply(lambda text : count_paragraphs(text))


#### Read two lists of keywords - technology and emotion keyowrds ####

# path to keyword lists directory 
input_dir = Path.cwd() / '../data/key_word_lists' 


# get the tech words
key_word_file_name = 'technology_list.txt'
with open(input_dir / key_word_file_name, 'r', encoding='utf-8-sig') as file:
    tech_key_words = file.read().split('\n')
    
# get the emo words    
key_word_file_name = 'emotion_list.txt'
with open(input_dir / key_word_file_name, 'r', encoding='utf-8-sig') as file:
    emo_key_words = file.read().split('\n')



    
### Build function to do the teck key word counting
def count_kw(text, key_word_list):
    text = text.lower()
    paragraphs = text.split('.')
    list_of_paragraphs = [p.strip() for p in paragraphs]
    
    count = 0
    # take paragraphs one by one
    for paragraph in list_of_paragraphs:
        # take every word in the list of keywords
        for key_word in key_word_list:
            # if a keyword is in the paragraph return boolean
            if re.search(key_word, paragraph):
                # then add one to the counter 
                count = count + 1
                # and then break out and return to the beginning of the loop 
                break
    return count

def count_kw_in_kw_p(text, key_word_list1, key_word_list2):
    text = text.lower()
    paragraphs = text.split('.')
    list_of_paragraphs = [p.strip() for p in paragraphs]
    
    count = 0
    # take paragraphs one by one
    for paragraph in list_of_paragraphs:
        
        # take every word in the list of tech words
        for key_word in key_word_list1:
            # if a word from the list is in the paragraph
            if re.search(key_word, paragraph):
                for key_word in key_word_list2:
                    if key_word in paragraph:
                        # then add one to the counter 
                        count = count + 1
                        # and then break out and return to the beginning of the loop 
                        break
    return count



######### Count keyword in paragraphs ##### 

print ('\n\nCount paragraphs with tech keywords:\n')    
df['p_lightning_kw'] = df['text'].progress_apply(lambda text : count_kw(text, tech_key_words))

######### Count paragraphs with lighting technology containing “emotion vocabulary”
print ('\n\nCount tech paragraphs that holds emotion keywords:\n') 
df['p_lightning_emotion_kw'] = df['text'].progress_apply(lambda text : count_kw_in_kw_p(text, tech_key_words, emo_key_words))


The file name is text_data230826.csv


Count paragraphs:



  0%|          | 0/101 [00:00<?, ?it/s]



Count paragraphs with tech keywords:



  0%|          | 0/101 [00:00<?, ?it/s]



Count tech paragraphs that holds emotion keywords:



  0%|          | 0/101 [00:00<?, ?it/s]

In [102]:
df

Unnamed: 0,year,author,title,text,clean_text,count_paragraphs,p_lightning_kw,p_lightning_emotion_kw
0,1838,Balzac,Splendeurs et misères des courtisanes 1 Esther...,SPLENDEURS ET MISÈRESDES COURTISANES Laissez...,splendeurs et misèresdes courtisanes laissez-m...,2002,45,18
1,1838,Balzac,Splendeurs et misères des courtisanes 2 A comb...,"Depuis huit jours, Nucingen allait marchander ...",depuis huit jours nucingen allait marchander l...,1870,18,1
2,1838,Balzac,Splendeurs et misères des courtisanes 3 Où mèn...,"Le lendemain, à six heures, deux voitures mené...",le lendemain à six heures deux voitures menées...,1304,24,4
3,1838,Balzac,Splendeurs et misères des courtisanes 4 La der...,"— Qu’y a-t-il, Madeleine? dit madame Camusot e...",qu’y a-t-il madeleine dit madame camusot en vo...,1853,10,3
4,1842,Balzac,La Rabouilleuse,"LES CÉLIBATAIRES. Voici, mon cher Nodier, un o...",les célibataires voici mon cher nodier un ouvr...,4017,45,7
...,...,...,...,...,...,...,...,...
96,1910,Colette,La Vagabonde,LA VAGABONDE \n\n\n\nParu dans Le Livre de Poc...,la vagabonde paru dans le livre de poche l'ing...,6407,64,18
97,1910,DelarueMardrus,Comme tout le monde,Première partie C’est un wagon de deuxième...,première partie c’est un wagon de deuxième cla...,3007,54,14
98,1910,Maupassant,Les Dimanches d'un bourgeois à Paris,"Monsieur Patissot, né à Paris, après avoir fai...",monsieur patissot né à paris après avoir fait ...,1057,21,5
99,1911,Daudet,Rose et Ninette,"au poëte et au philosophe, je dédie cette page...",au poëte et au philosophe je dédie cette page ...,905,25,7


In [68]:
# get relative data
new_df = df.copy()
new_df['lightning technology paragraphs share of all paragraphs'] = new_df['p_lightning_kw'] / new_df['count_paragraphs'] 
new_df['emo and lightning technology paragraph share of all paragraphs'] = new_df['p_lightning_emotion_kw'] / new_df['count_paragraphs'] 
new_df['emo and lightning technology paragraph share of lightning technology paragraph'] = new_df['p_lightning_emotion_kw'] / new_df['p_lightning_kw']

new_df = new_df.iloc[:, [0, 1,2,3,6,7,8,9,10]]
new_df

Unnamed: 0,year,author,title,text,p_lightning_kw,p_lightning_emotion_kw,lightning technology paragraphs share of all paragraphs,emo and lightning technology paragraph share of all paragraphs,emo and lightning technology paragraph share of lightning technology paragraph
0,1838,Balzac,Splendeurs et misères des courtisanes 1 Esther...,SPLENDEURS ET MISÈRESDES COURTISANES Laissez...,45,18,0.022478,0.008991,0.400000
1,1838,Balzac,Splendeurs et misères des courtisanes 2 A comb...,"Depuis huit jours, Nucingen allait marchander ...",18,1,0.009626,0.000535,0.055556
2,1838,Balzac,Splendeurs et misères des courtisanes 3 Où mèn...,"Le lendemain, à six heures, deux voitures mené...",24,4,0.018405,0.003067,0.166667
3,1838,Balzac,Splendeurs et misères des courtisanes 4 La der...,"— Qu’y a-t-il, Madeleine? dit madame Camusot e...",10,3,0.005397,0.001619,0.300000
4,1842,Balzac,La Rabouilleuse,"LES CÉLIBATAIRES. Voici, mon cher Nodier, un o...",45,7,0.011202,0.001743,0.155556
...,...,...,...,...,...,...,...,...,...
96,1910,Colette,La Vagabonde,LA VAGABONDE \n\n\n\nParu dans Le Livre de Poc...,64,18,0.009989,0.002809,0.281250
97,1910,DelarueMardrus,Comme tout le monde,Première partie C’est un wagon de deuxième...,54,14,0.017958,0.004656,0.259259
98,1910,Maupassant,Les Dimanches d'un bourgeois à Paris,"Monsieur Patissot, né à Paris, après avoir fai...",21,5,0.019868,0.004730,0.238095
99,1911,Daudet,Rose et Ninette,"au poëte et au philosophe, je dédie cette page...",25,7,0.027624,0.007735,0.280000


In [69]:
import plotly.express as px

In [97]:
x_var = new_df.index + 1
y_var = 'emo and lightning technology paragraph share of all paragraphs'


fig = px.bar(new_df.sort_values(by=y_var, ascending=False), 
            x = x_var, 
            y = y_var,
            hover_data=['author','title', 'year', 'emo and lightning technology paragraph share of all paragraphs'],
            title= 'Relative frequency of paragraphs')
             
# Update x-axis label
fig.update_xaxes(title_text='Document in chronological order')

# Update y-axis label
fig.update_yaxes(title_text='Relative frequence') 

fig.show()

In [99]:
import plotly.io as io

html_snippet_start = '<!DOCTYPE html> <html> <head> <title>Title</title> </head> <body>' 
html_snippet_end = ' </body></html> '

html_as_string = io.to_html(fig, full_html=False)

vis_in_html = html_snippet_start + html_as_string + html_snippet_end

of = open(r'C:\Users\lakj\Lighting in French Literature\visualisations\emo_in_light_para2.htm', 'w', encoding='utf-8-sig')
of.write(vis_in_html)

3697588

# Keyword in context - or find a text snippet based on keywords and a range

We want to find a word for example 'lumière' as well as words that are related to the word, and we have to have some context because we are actually interested in pointing down the text and seeing exactly how lumière is used.

For this we need to use \w. because it gives us more word characters and {30} checks that we get 30 word characters before we hit the letters lumière. \b in front of lumière searches so that we only find words that begin with lumière and not words where lumière is part of the word, e.g. looking. After lumière, \w.{30} searches for another 30 word characters.

The pipe , | , means 'or' and alow us to get more than one word in the same search.

Below I choose to look for contexts in texts of the author Balzac.

In [None]:
text = ' '.join(df[df['author'] == 'Balzac']['clean_text'])
import re
context = re.findall(r'.{0,50}\blampe.{0,40}|.{0,50}\blustre.{0,40}', text)
context