Claes Pauline. Master Digital Text Analysis. Student ID: 20163274

# Calculating normalized frequencies on text level 


In this file, normalized frequencies of the constructions themselves as well as of certain context parameters will be calculated on **text level**. 

<u>How will I go about this?</u>
In order to calculate normalized frequency of a certain observation, the raw number of observations is divided by the word count of the subcorpus in which this observation is made, and multiplied by 1,000,000. So, if we want to obtain the normalized frequency of a certain observation (e.g. _be going to_-construction), we consider the text in which it  occurs to be the 'subcorpus'. That means, that the number of observations is divided by the word count of a certain text and multiplied by 1,000,000. 

For example, we want to count how many times the word 'thesis' occurs in a text in our corpus, taking into account that each row in the dataframe equals 1 observation
- group by file name
- count number of rows for that file name and insert it in a column
- divide the number of rows (so the number of observations) by the word count of that text, and multiply by 1,000,000


In [1]:
import pandas as pd
import numpy as np

In [3]:
temp_en = pd.read_excel('/Users/paulineclaes/Documents/dta/thesis/ClaesPauline_thesis_finaleversie/ClaesPauline_data/ClaesPauline_final_GoToInf.xlsx')

In [4]:
temp_en.head()

Unnamed: 0,period,period_category,data_identifier,timeframe,author_id,text_id_per_author,authorId_textId,filename,title,author,...,position,position_fronting,argmt,adverbial,split,attention?,preceding_context,match,following_context,note
0,1600-1609,3,ET13,early,11,1,11_1,A05339.xml,Noua Francia : or The description of that part...,"Erondelle, P.",...,,,,,,,", neere about the Açors, well fil led with ...",going,a fiſhing for New-found-land-fiſh. And they as...,
1,1600-1609,3,ET09,early,8,1,8_1,A01991.xml,Admirable and memorable histories containing t...,"Grimeston, Edward",...,,,,,,,", by reason of the greatnesse and length. Thi...",going,"a iourney with his Wa gon, was be-nighted and ...",
2,1600-1609,3,ET13,early,11,1,11_1,A05339.xml,Noua Francia : or The description of that part...,"Erondelle, P.",...,,,,,,,"hogſ heads of Meale, which were giuen to the...",going,a way. The eleuenth of Auguſt the ſaid Monſi...,
3,1600-1609,3,ET14,early,12,1,12_1,99850354.xml,"Fovvre bookes, of the institution, vse and doc...",anon10,...,,,,,,,": that is, of a s...",going,"about it seauen times, ...",
4,1600-1609,3,ET14,early,12,1,12_1,99850354.xml,"Fovvre bookes, of the institution, vse and doc...",anon10,...,,,,,,,impotencie weaknesse ...,going,"about to adore the Head in heauen, ...",


In [5]:
# check the unique values
temp_en.cxn.unique()

array(['noise', '[BE about Ving]+[GO to V]', '[free-adjunct]+[GO to V]',
       '[BE Ving]+[GO to V]', '[GO to V]'], dtype=object)

In [6]:
# remove noise
en = temp_en.drop(temp_en[(temp_en.cxn == 'noise') | (temp_en.cxn == '[BE about Ving]+[GO to V]')].index)

In [7]:
# extracting progressive and nonProgressive

en_progr = en.drop(en[(en.aspect != 'progressive')].index)
en_nonprogr = en.drop(en[(en.aspect != 'nonProgressive')].index)

In [8]:
en.cxn.unique()

array(['[free-adjunct]+[GO to V]', '[BE Ving]+[GO to V]', '[GO to V]'],
      dtype=object)

In [9]:
en_progr.cxn.unique()

array(['[free-adjunct]+[GO to V]', '[BE Ving]+[GO to V]'], dtype=object)

In [10]:
en_nonprogr.cxn.unique()

array(['[GO to V]'], dtype=object)

In [11]:
# constructing a dictionary with unique authorId_textIds and their number of concordances (both go-constructions)
rows_dict_both = {key:len(value) for key, value in en.groupby('authorId_textId').groups.items()}

# constructing a dictionary with unique authorId_textIds and their number of PROGRESSIVE concordances
rows_dict_progr = {key:len(value) for key, value in en_progr.groupby('authorId_textId').groups.items()}

# constructing a dictionary with unique authorId_textIds and their number of NONPROGRESSIVE concordances
rows_dict_nonprogr = {key:len(value) for key, value in en_nonprogr.groupby('authorId_textId').groups.items()}

In [12]:
en.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 458 entries, 19 to 1493
Data columns (total 41 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   period                       458 non-null    object
 1   period_category              458 non-null    int64 
 2   data_identifier              458 non-null    object
 3   timeframe                    458 non-null    object
 4   author_id                    458 non-null    int64 
 5   text_id_per_author           458 non-null    int64 
 6   authorId_textId              458 non-null    object
 7   filename                     458 non-null    object
 8   title                        458 non-null    object
 9   author                       457 non-null    object
 10  textDate                     458 non-null    int64 
 11  wordcount                    458 non-null    object
 12  USTC_subject_classification  458 non-null    object
 13  kind                         458 

In [13]:
# selecting only interesting columns
temp_new_df = en[['author', 'author_id', 'text_id_per_author', 'authorId_textId', 'textDate','timeframe', 'kind', 'USTC_subject_classification','period', 'period_category', 'wordcount']]

In [14]:
# keeping only unique authorId_textIds
new_df_en = temp_new_df.drop_duplicates(subset=['authorId_textId'])

In [15]:
len(new_df_en) == len(new_df_en.authorId_textId.unique())

True

In [16]:
new_df_en.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52 entries, 19 to 1489
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   author                       52 non-null     object
 1   author_id                    52 non-null     int64 
 2   text_id_per_author           52 non-null     int64 
 3   authorId_textId              52 non-null     object
 4   textDate                     52 non-null     int64 
 5   timeframe                    52 non-null     object
 6   kind                         52 non-null     object
 7   USTC_subject_classification  52 non-null     object
 8   period                       52 non-null     object
 9   period_category              52 non-null     int64 
 10  wordcount                    52 non-null     object
dtypes: int64(4), object(7)
memory usage: 4.9+ KB


In [16]:
# inserting column with raw number of instances per authorId_textId of BOTH constructions
new_df_en.insert(11, 'both', new_df_en['authorId_textId'].map(rows_dict_both))

In [17]:
# inserting column with raw number of instances per authorId_textId of PROGRESSIVE construction
new_df_en.insert(12, 'progr', new_df_en['authorId_textId'].map(rows_dict_progr))

In [18]:
# inserting column with raw number of instances per authorId_textId of NONPROGRESSIVE construction
new_df_en.insert(13, 'nonprogr', new_df_en['authorId_textId'].map(rows_dict_nonprogr))

In [19]:
new_df_en = new_df_en.fillna(0)

In [20]:
def calculator(df, condition, grouping_object='authorId_textId'):
    new_df = df.drop(df[condition].index)
    rows_dict = {key:len(value) for key, value in new_df.groupby(grouping_object).groups.items()}
    return rows_dict

In [21]:
col_dict = {}

In [22]:
# motion
both_motion_yes = calculator(en, (en.motion != 'yes'))
col_dict['both_motion_yes'] = both_motion_yes

both_motion_no = calculator(en, (en.motion != 'no'))
col_dict['both_motion_no'] = both_motion_no

progr_motion_yes = calculator(en_progr, (en_progr.motion != 'yes'))
col_dict['progr_motion_yes'] = progr_motion_yes 

progr_motion_no = calculator(en_progr, (en_progr.motion != 'no'))
col_dict['progr_motion_no'] = progr_motion_no 

nonprogr_motion_yes = calculator(en_nonprogr, (en_nonprogr.motion != 'yes'))
col_dict['nonprogr_motion_yes'] = nonprogr_motion_yes 

nonprogr_motion_no = calculator(en_nonprogr, (en_nonprogr.motion != 'no'))
col_dict['nonprogr_motion_no'] = nonprogr_motion_no 


In [23]:
# voice
both_voice_active = calculator(en, (en.voice != 'active'))
col_dict['both_voice_active'] = both_voice_active

both_voice_passive = calculator(en, (en.voice != 'passive'))
col_dict['both_voice_passive'] = both_voice_passive

progr_voice_active = calculator(en_progr, (en_progr.voice != 'active'))
col_dict['progr_voice_active'] = progr_voice_active

progr_voice_passive = calculator(en_progr, (en_progr.voice != 'passive'))
col_dict['progr_voice_passive'] = progr_voice_passive

nonprogr_voice_active = calculator(en_nonprogr, (en_nonprogr.voice != 'active'))
col_dict['nonprogr_voice_active'] = nonprogr_voice_active

nonprogr_voice_passive = calculator(en_nonprogr, (en_nonprogr.voice != 'passive'))
col_dict['nonprogr_voice_passive'] = nonprogr_voice_passive



In [24]:
# animacy
both_anim_animate = calculator(en, (en.anim_binary != 'animate'))
col_dict['both_anim_animate'] = both_anim_animate

both_anim_inanimate = calculator(en, (en.anim_binary != 'inanimate'))
col_dict['both_anim_inanimate'] = both_anim_inanimate

progr_anim_animate = calculator(en_progr, (en_progr.anim_binary != 'animate'))
col_dict['progr_anim_animate'] = progr_anim_animate

progr_anim_inanimate = calculator(en_progr, (en_progr.anim_binary != 'inanimate'))
col_dict['progr_anim_inanimate'] = progr_anim_inanimate

nonprogr_anim_animate = calculator(en_nonprogr, (en_nonprogr.anim_binary != 'animate'))
col_dict['nonprogr_anim_animate'] = nonprogr_anim_animate

nonprogr_anim_inanimate = calculator(en_nonprogr, (en_nonprogr.anim_binary != 'inanimate'))
col_dict['nonprogr_anim_inanimate'] = nonprogr_anim_inanimate


In [25]:
# predictiveness 
both_pred_pred = calculator(en, (en.predictiveness_binary != 'predict.x'))
col_dict['both_pred_pred'] = both_pred_pred

both_pred_notPred = calculator(en, (en.predictiveness_binary != 'not_predict'))
col_dict['both_pred_notPred'] = both_pred_notPred



progr_pred_pred = calculator(en_progr, (en_progr.predictiveness_binary != 'predict.x'))
col_dict['progr_pred_pred'] = progr_pred_pred

progr_pred_notPred = calculator(en_progr, (en_progr.predictiveness_binary != 'not_predict'))
col_dict['progr_pred_notPred'] = progr_pred_notPred


nonprogr_pred_pred = calculator(en_nonprogr, (en_nonprogr.predictiveness_binary != 'predict.x'))
col_dict['nonprogr_pred_pred'] = nonprogr_pred_pred

nonprogr_pred_notPred = calculator(en_nonprogr, (en_nonprogr.predictiveness_binary != 'not_predict'))
col_dict['nonprogr_pred_notPred'] = nonprogr_pred_notPred


In [26]:
# intentionality 

both_int_int = calculator(en, (en.intentionality != 'int'))
col_dict['both_int_int'] = both_int_int
both_int_nint = calculator(en, (en.intentionality != 'nint'))
col_dict['both_int_nint'] = both_int_nint

progr_int_int = calculator(en_progr, (en_progr.intentionality != 'int'))
col_dict['progr_int_int'] = progr_int_int
progr_int_nint = calculator(en_progr, (en_progr.intentionality != 'nint'))
col_dict['progr_int_nint'] = progr_int_nint

nonprogr_int_int = calculator(en_nonprogr, (en_nonprogr.intentionality != 'int'))
col_dict['nonprogr_int_int'] = nonprogr_int_int
nonprogr_int_nint = calculator(en_nonprogr, (en_nonprogr.intentionality != 'nint'))
col_dict['nonprogr_int_nint'] = nonprogr_int_nint

In [27]:
for dict_name, dictionary in col_dict.items():
   # print(dict_name)
    new_df_en[dict_name] = new_df_en['authorId_textId'].map(dictionary)

In [28]:
new_df_en.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52 entries, 19 to 1489
Data columns (total 44 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   author                       52 non-null     object 
 1   author_id                    52 non-null     int64  
 2   text_id_per_author           52 non-null     int64  
 3   authorId_textId              52 non-null     object 
 4   textDate                     52 non-null     int64  
 5   timeframe                    52 non-null     object 
 6   kind                         52 non-null     object 
 7   USTC_subject_classification  52 non-null     object 
 8   period                       52 non-null     object 
 9   period_category              52 non-null     int64  
 10  wordcount                    52 non-null     int64  
 11  both                         52 non-null     int64  
 12  progr                        52 non-null     float64
 13  nonprogr           

In [29]:
def calculate_relfreq(x, y, fixedNr=1000000):
    """
    Function to calculate relative frequency.
    
    Takes two columns as arguments, as well as a fixed number to multiply by, which defaults to 1,000,000.
    
    
    Returns: relative frequency, calculated by dividing the raw number of occurrences by the total 
    wordcount and multiplied with the fixed number.
    
    """
    
    return (x/y)*fixedNr

In [30]:
# for each column containing raw frequencies, insert a new column with normalized frequencies of that raw number 

for column in new_df_en.iloc[:,11:44].columns:
    new_df_en['norm_{}'.format(column)] = calculate_relfreq(new_df_en[column], new_df_en['wordcount'], 1000000)

In [31]:
new_df_en = new_df_en.fillna(0)

In [32]:
# insert binary column with 0 if early and 1 if later
new_df_en.insert(6, 'timeframe_binary', [0 if value == 'early' else 1 for value in new_df_en.timeframe])

In [33]:
# insert binary column with 0 if reference and 1 if translation
new_df_en.insert(8, 'kind_binary', [0 if value == 'reference' else 1 for value in new_df_en.kind])

In [34]:
#new_df_en.to_excel('/Users/paulineclaes/Documents/dta/thesis/finaldata/en_textStats.xlsx',
#                  index=False,
#                  na_rep=0)

### Doing the same for French data

In [17]:
import pandas as pd

In [18]:
temp_fr = pd.read_excel('/Users/paulineclaes/Documents/dta/thesis/ClaesPauline_thesis_finaleversie/data/final_AllerINF.xlsx')

In [19]:
temp_fr.head()

Unnamed: 0,timeframe,period_category,kind,data_identifier,author_id,text_id_per_author,authorId_textId,fr_source_filename,fr_source_title,fr_source_author,...,vpers_grammaticalPerson,pos,previous50,prev1,aller,aller_POS,INF,next1,next1_POS,next50
0,early,2,french_original,EFS01,35,1,35_1,WFtemp_LABE_Debat_de_folie_et_amour,Débat de folie et d'amour,"Labé, Louise",...,3,inf,"à gouverner les Viles , sans que lon l' apelle...",d',aller,VINF,planter,des,P+D,"chous . Le fol ira tant et viendra , en donner..."
1,early,2,french_original,EFS01,35,1,35_1,WFtemp_LABE_Debat_de_folie_et_amour,Débat de folie et d'amour,"Labé, Louise",...,3,pres,ay dit . Quand Mercure ut fini la defense de F...,",",và,VP,prononcer,un,DET,arrest interlocutoire en cette maniere : Pour ...
2,early,3,french_original,EFS13,47,1,47_1,LESCARBOT_cleaned,"Histoire de la Nouvelle France (Lescarbot, Marc)","Lescarbot, Marc",...,3,past,de Canada & Hochelaga au temps de post Tacques...,est,allé,VPP,rechercher,leurs,DET,"pelleteries , Canada que pour icelles ils ont ..."
3,early,3,french_original,EFS13,47,1,47_1,LESCARBOT_cleaned,"Histoire de la Nouvelle France (Lescarbot, Marc)","Lescarbot, Marc",...,3,pres,qu' vn autre & n' en perdent point yn tour de ...,les,vont,VP,voir,de,ADP,plus grand chose : comme pardeça quand on pres...
4,early,3,french_original,EFS13,47,1,47_1,LESCARBOT_cleaned,"Histoire de la Nouvelle France (Lescarbot, Marc)","Lescarbot, Marc",...,un,pres,à l' vn des bours dudit lac ne nous apparoisso...,&,aller,VINF,chercher,passage,NOUN,tre ou cinq rivieres toutes sortantes dudit ff...


In [20]:
temp_fr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1363 entries, 0 to 1362
Data columns (total 41 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   timeframe                  1363 non-null   object
 1   period_category            1363 non-null   int64 
 2   kind                       1363 non-null   object
 3   data_identifier            1363 non-null   object
 4   author_id                  1363 non-null   int64 
 5   text_id_per_author         1363 non-null   int64 
 6   authorId_textId            1363 non-null   object
 7   fr_source_filename         1363 non-null   object
 8   fr_source_title            1363 non-null   object
 9   fr_source_author           1363 non-null   object
 10  fr_source_textDate         1363 non-null   int64 
 11  eebo_genre                 1363 non-null   object
 12  fr_source_all_tokens       1363 non-null   int64 
 13  en_translation_filename    1363 non-null   object
 14  en_trans

In [21]:
len(temp_fr[temp_fr['anim_simplified']=='dummy'])

12

In [22]:
temp_fr.voice.unique()

array(['active', 'noise', 'passive'], dtype=object)

In [23]:
# remove noise
fr = temp_fr.drop(temp_fr[(temp_fr.voice == 'noise')].index)

In [24]:
fr.voice.unique()

array(['active', 'passive'], dtype=object)

In [25]:
# constructing a dictionary with unique authorId_textIds and their number of concordances 
rows_dict = {key:len(value) for key, value in fr.groupby('authorId_textId').groups.items()}

In [26]:
# select some columns
temp_new_df_fr = fr[['fr_source_author','author_id', 'text_id_per_author', 'authorId_textId', 'fr_source_textDate','timeframe', 'kind', 'eebo_genre', 'period_category', 'fr_source_all_tokens']]


In [27]:
# keeping only unique authorId_textIds

new_df_fr = temp_new_df_fr.drop_duplicates(subset=['authorId_textId'])

In [28]:
len(new_df_fr) == len(new_df_fr.authorId_textId.unique())

True

In [29]:
new_df_fr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37 entries, 0 to 1289
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   fr_source_author      37 non-null     object
 1   author_id             37 non-null     int64 
 2   text_id_per_author    37 non-null     int64 
 3   authorId_textId       37 non-null     object
 4   fr_source_textDate    37 non-null     int64 
 5   timeframe             37 non-null     object
 6   kind                  37 non-null     object
 7   eebo_genre            37 non-null     object
 8   period_category       37 non-null     int64 
 9   fr_source_all_tokens  37 non-null     int64 
dtypes: int64(5), object(5)
memory usage: 3.2+ KB


In [30]:
def calculate_relfreq(x, y, fixedNr=1000000):
    """
    Function to calculate relative frequency.
    
    Takes two columns as arguments, as well as a fixed number to multiply by, which defaults to 1,000,000.
    
    
    Returns: relative frequency, calculated by dividing the raw number of occurrences by the total 
    wordcount and multiplied with the fixed number.
    
    """
    
    return (x/y)*fixedNr

In [31]:
# insert column with raw number of go-constructions in French texts
new_df_fr.insert(10, 'raw_n', new_df_fr['authorId_textId'].map(rows_dict))

In [32]:
# insert column with normalized frequency of go-constructions in French texts
new_df_fr.insert(11, 'relfreq', calculate_relfreq(new_df_fr.raw_n, new_df_fr.fr_source_all_tokens, 1000000))

In [33]:
new_df_fr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37 entries, 0 to 1289
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fr_source_author      37 non-null     object 
 1   author_id             37 non-null     int64  
 2   text_id_per_author    37 non-null     int64  
 3   authorId_textId       37 non-null     object 
 4   fr_source_textDate    37 non-null     int64  
 5   timeframe             37 non-null     object 
 6   kind                  37 non-null     object 
 7   eebo_genre            37 non-null     object 
 8   period_category       37 non-null     int64  
 9   fr_source_all_tokens  37 non-null     int64  
 10  raw_n                 37 non-null     int64  
 11  relfreq               37 non-null     float64
dtypes: float64(1), int64(6), object(5)
memory usage: 3.8+ KB


In [34]:
new_df_fr.timeframe.unique()

array(['early', 'later'], dtype=object)

In [35]:
# insert binary column with 0 if early and 1 if later
new_df_fr.insert(6, 'timeframe_binary', [0 if value=='early' else 1 for value in new_df_fr.timeframe])

In [51]:
#new_df_fr.to_excel('/Users/paulineclaes/Documents/dta/thesis/finaldata/fr_textStats_new.xlsx',
#                  index=False,
#                  na_rep=0)

In [36]:
def calculator(df, condition, grouping_object='authorId_textId'):
    '''Function to calculate raw number of observations per unique text.'''
    # drop all rows based on a negative condition
        # so if we want to have raw count in go-constructions where motion is used, we specify the condition to drop all rows where motion is NOT 'yes'.
    new_df = df.drop(df[condition].index) 
    rows_dict = {key:len(value) for key, value in new_df.groupby(grouping_object).groups.items()}
    return rows_dict

In [37]:
# instantiate empty dictionary to which all counts will be added 
col_dict = {}

In [38]:
motion_yes = calculator(fr, (fr.motion != 'yes'))
col_dict['motion_yes'] = motion_yes
motion_no = calculator(fr, (fr.motion != 'no'))
col_dict['motion_no'] = motion_no

voice_active = calculator(fr, (fr.voice != 'active'))
col_dict['voice_active'] = voice_active
voice_passive = calculator(fr, (fr.voice != 'passive'))
col_dict['voice_passive'] = voice_passive

anim_animate = calculator(fr, (fr.anim_simplified != 'animate'))
col_dict['anim_animate'] = anim_animate
anim_dummy = calculator(fr, (fr.anim_simplified != 'dummy'))
col_dict['anim_dummy'] = anim_dummy
anim_inanimate = calculator(fr, (fr.anim_simplified != 'inanimate'))
col_dict['anim_inanimate'] = anim_inanimate

pred_pred = calculator(fr, (fr.predictiveness_binary != 'predict.x'))
col_dict['pred_pred'] = pred_pred
pred_notPred = calculator(fr, (fr.predictiveness_binary != 'not_predict'))
col_dict['pred_notPred'] = pred_notPred

int_int = calculator(fr, (fr.intentionality != 'int'))
col_dict['int_int'] = int_int
int_nint = calculator(fr, (fr.intentionality != 'nint'))
col_dict['int_nint'] = int_nint

In [44]:
# check to see if it worked
print(list(col_dict.items())[1])

('motion_no', {'35_1': 1, '36_1': 70, '37_1': 3, '38_1': 4, '39_1': 15, '39_2': 15, '40_1': 59, '42_1': 21, '43_1': 11, '44_1': 27, '45_1': 34, '46_1': 4, '47_1': 17, '48_1': 7, '49_1': 1, '50_1': 3, '51_1': 3, '53_1': 15, '54_1': 4, '71_1': 9, '71_3': 2, '72_1': 26, '73_1': 14, '74_1': 3, '75_1': 18, '76_1': 11, '76_2': 25, '77_1': 2, '78_1': 3, '79_1': 17, '80_1': 9, '81_1': 25, '82_1': 3, '82_2': 23, '84_1': 5, '85_1': 3})


> Result: dictionary with values for each authorId_textId

In [45]:
for dict_name, dictionary in col_dict.items():
   # print(dict_name)
    new_df_fr[dict_name] = new_df_fr['authorId_textId'].map(dictionary)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df_fr[dict_name] = new_df_fr['authorId_textId'].map(dictionary)


In [46]:
new_df_fr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37 entries, 0 to 1289
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fr_source_author      37 non-null     object 
 1   author_id             37 non-null     int64  
 2   text_id_per_author    37 non-null     int64  
 3   authorId_textId       37 non-null     object 
 4   fr_source_textDate    37 non-null     int64  
 5   timeframe             37 non-null     object 
 6   timeframe_binary      37 non-null     int64  
 7   kind                  37 non-null     object 
 8   eebo_genre            37 non-null     object 
 9   period_category       37 non-null     int64  
 10  fr_source_all_tokens  37 non-null     int64  
 11  raw_n                 37 non-null     int64  
 12  relfreq               37 non-null     float64
 13  motion_yes            34 non-null     float64
 14  motion_no             36 non-null     float64
 15  voice_active          3

In [47]:
new_df_fr.rename(columns={'fr_source_all_tokens':'wordcount'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [48]:
new_df_fr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37 entries, 0 to 1289
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   fr_source_author    37 non-null     object 
 1   author_id           37 non-null     int64  
 2   text_id_per_author  37 non-null     int64  
 3   authorId_textId     37 non-null     object 
 4   fr_source_textDate  37 non-null     int64  
 5   timeframe           37 non-null     object 
 6   timeframe_binary    37 non-null     int64  
 7   kind                37 non-null     object 
 8   eebo_genre          37 non-null     object 
 9   period_category     37 non-null     int64  
 10  wordcount           37 non-null     int64  
 11  raw_n               37 non-null     int64  
 12  relfreq             37 non-null     float64
 13  motion_yes          34 non-null     float64
 14  motion_no           36 non-null     float64
 15  voice_active        37 non-null     int64  
 16  voice_pa

In [49]:
def calculate_relfreq(x, y, fixedNr=1000000):
    """
    Function to calculate relative frequency.
    
    Takes two columns as arguments, as well as a fixed number to multiply by, which defaults to 1,000,000.
    
    
    Returns: relative frequency, calculated by dividing the raw number of occurrences by the total 
    wordcount and multiplied with the fixed number.
    
    """
    
    return (x/y)*fixedNr

In [50]:
for column in new_df_fr.iloc[:,13:24].columns:
    new_df_fr['norm_{}'.format(column)] = calculate_relfreq(new_df_fr[column], new_df_fr['wordcount'], 1000000)
    
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df_fr['norm_{}'.format(column)] = calculate_relfreq(new_df_fr[column], new_df_fr['wordcount'], 1000000)


In [51]:
new_df_fr.head()

Unnamed: 0,fr_source_author,author_id,text_id_per_author,authorId_textId,fr_source_textDate,timeframe,timeframe_binary,kind,eebo_genre,period_category,...,norm_motion_no,norm_voice_active,norm_voice_passive,norm_anim_animate,norm_anim_dummy,norm_anim_inanimate,norm_pred_pred,norm_pred_notPred,norm_int_int,norm_int_nint
0,"Labé, Louise",35,1,35_1,1555,early,0,french_original,literature,2,...,35.29827,70.596541,,35.29827,35.29827,,,70.596541,35.29827,
2,"Lescarbot, Marc",47,1,47_1,1609,early,0,french_original,history,3,...,61.593757,326.084594,,318.838269,,7.246324,3.623162,322.461431,307.968783,18.115811
92,Goulart,43,1,43_1,1604,early,0,french_original,history,3,...,50.410155,233.719811,,229.13707,,4.582741,4.582741,229.13707,192.475139,41.244673
143,"Mornay, Philippe de",48,1,48_1,1598,early,0,french_original,religious,3,...,18.857403,32.326977,,29.633062,,2.693915,8.081744,24.245232,16.163488,5.387829
149,Lanoue,44,1,44_1,1587,early,0,french_original,history,3,...,107.844273,263.619334,,251.636637,3.994232,,3.994232,259.625101,235.659707,7.988465


In [52]:
new_df_fr = new_df_fr.fillna(0)

In [64]:
#new_df_fr.to_excel('/Users/paulineclaes/Documents/dta/thesis/finaldata/fr_textStats.xlsx',
#                  index=False,
#                  na_rep=0)