# README

This notebook describes the function that make fixes to the data before processing by the algorithms:
1. Store all the relevant meta with a predefined headers in addition to the protocol itself data
2. fix speakers names to be insensitive to typos
3. add PersonID to each speaker so the data of each segment of speach can be related to other files

Input:
1. metadata from the file 'data_committees_kns_committeesession_kns_committeesession.csv' - we found that this table already stores the committee name for each session along with the ID
2. all files in the meeting_protocols_parts forlder.
3. link between person name to entity id from 'data_members_kns_person_kns_person.csv'

Output:
1. fixed version of the parsed protocols with subject id
2. an additional data to the metadata file to include a new column of the speakers id on each session

# Configuration

In [None]:
import numpy as np
import sys, os
import pandas as pd
import re
import nltk
import warnings
import pickle

I used this format to find the data
<b> guess you will need to replace it to the format created by the pipline </b>

In [1]:
PROJECT_ROOT = r'C:\Users\lotem\knesset_data'
META_SOURCE = os.path.join(PROJECT_ROOT, 'data_committees_kns_committeesession_kns_committeesession.csv')
COMMITTEESS_ROOT = os.path.join(PROJECT_ROOT, 'meeting_protocols_parts')
PEOPLE_SOURCE = os.path.join(PROJECT_ROOT, 'data_members_kns_person_kns_person.csv')

# Get meta

In [2]:
meta = pd.read_csv(META_SOURCE)
meta = meta[~meta['parts_parsed_filename'].isnull()]
meta.head(2)

Unnamed: 0,CommitteeSessionID,Number,KnessetNum,TypeID,TypeDesc,CommitteeID,Location,SessionUrl,BroadcastUrl,StartDate,...,download_filename,download_filesize,parts_crc32c,parts_filesize,parts_parsed_filename,text_crc32c,text_filesize,text_parsed_filename,topics,committee_name
9,64550,4.0,16,161,פתוחה,24,"חדר הוועדה, באגף הוועדות (קדמה), קומה 1, חדר 1820",http://main.knesset.gov.il/Activity/committees...,,2003-03-17 09:30:00,...,files/23/3/6/367838.DOC,64814.0,4MXPig==,96793.0,files/6/4/64550.csv,lSu4Ug==,97645.0,files/6/4/64550.txt,"[""שדה התעופה בהרצליה - כפר שמריהו""]",הפנים ואיכות הסביבה
10,64551,2.0,16,161,פתוחה,24,"חדר הוועדה, באגף הוועדות (קדמה), קומה 1, חדר 1820",http://main.knesset.gov.il/Activity/committees...,,2003-03-11 10:30:00,...,files/23/3/6/367839.DOC,68247.0,o5iVPg==,115462.0,files/6/4/64551.csv,eHpjug==,116107.0,files/6/4/64551.txt,"[""הזרמת מי שופכין לים תיכון ""]",הפנים ואיכות הסביבה


Load the persons data

In [3]:
ppldf = pd.read_csv(PEOPLE_SOURCE)
ppldf.head()

Unnamed: 0,PersonID,LastName,FirstName,GenderID,GenderDesc,Email,IsCurrent,LastUpdatedDate
0,30299,נתונים,אין,251,זכר,,True,2000-01-01 00:00:00
1,1026,אברהם-בלילא,רוחמה,250,נקבה,,False,2015-03-20 12:03:08
2,1029,רצון,מיכאל,251,זכר,,False,2015-03-20 12:03:08
3,1030,והבה,מגלי,251,זכר,,False,2015-03-20 12:03:08
4,1031,אדרי,יעקב,251,זכר,,False,2015-03-20 12:03:08


# define preprocessing steps

1. get_full_url - get according to meta data file relative directory the full address of the files
2. remove_special_character - get a string and remove special character (now it runs on all the data, maybe better to run only on header and let other phases to make their corrections on the real text
3. remove_person_title - for now basically remove the "היור" from the speaker name so we will have only the real name
4. remove_calls - I removed the rows with apeak name "קריאה", guess if it wasn't documented with a speaker it will just disturb the algorithms
5. fix_headers - embbedd relevant meta data in header cells that starts with '_', so this files will have all the necessary data for the nlp analysis
6. load_and_preprocess - the function that actually call all the preprocessing steps on one row in the metadata file

In [4]:
def get_full_url(metadf, root_folder):
    if type(metadf) is pd.DataFrame:
        return [os.path.join(root_folder,'meeting_protocols_parts', x[1]['parts_parsed_filename']) 
            for x in metadf.iterrows()]
    else:
        return os.path.join(root_folder,'meeting_protocols_parts', metadf['parts_parsed_filename']) 
    

def remove_special_character(strg):
    CHARS_TO_FILTER = '\.|,|"|\(|\)|;|:|\?'
    CHARS_TO_WHITE = '\t|\n'
    if type(strg) is str:
        strg = re.sub(CHARS_TO_FILTER, '', strg)
        strg = re.sub(CHARS_TO_WHITE, ' ', strg)
    else:
        strg = str('')   
    return strg

def remove_person_title(strng):
    constName = "היור "
    if strng.startswith(constName):
        strng = strng[len(constName):]
    return strng

def remove_calls(df, **kwargs):
    df = df.drop(df[df['header']=='קריאה'].index, **kwargs)
    return df
    
def fix_header(df, meta):
    df.drop(index=0, inplace=True)
    if df['header'][1] == 'נכחו':
        df['header'][1] = '_original_attendens_cell'
    
    others = {'header': ['_topics', '_committee_name', '_number', '_KnessetNum'],
        'body': [meta['topics'],meta['committee_name'], meta['Number'], meta['KnessetNum']]}
    
    others = pd.DataFrame.from_dict(others)
    df = df.append(others, sort=False)
    df.reset_index(inplace=True, drop=True)
    return df

def fix_speakers_names(df):
    all_names = df['header'][~df['header'].str.startswith('_')].unique()
    base_names = []
    names_dict = {}
    to_remove = np.full(len(all_names),False)
    for i, nm in enumerate(all_names):
        dists = np.array([nltk.edit_distance(nm, x) for x in base_names])
        if np.all(dists>3):
            base_names.append(nm)
            names_dict[nm] = nm
        else:
             names_dict[nm] = base_names[np.argmin(dists)]
    df['header'].apply(lambda x: names_dict.get(x,x))
    df = df.append(pd.DataFrame.from_dict({'header':'_speakers', 'body':all_names}))
    return df

def add_speaker_id_to_comittee_session(df, ppldf):
    ppldf['conc'] = ppldf['FirstName']+' ' + ppldf['LastName']
    df_speakers = df['body'][df['header']=='_speakers'].values
    speaker2id = {}
    for speaker in df_speakers:
        if np.any(ppldf['conc'].values==speaker):
            speaker2id[speaker] = ppldf['PersonID'][ppldf['conc'].values==speaker].values[0]
#         else:
#             dists = np.array([nltk.edit_distance(speaker, x) for x in ppldf['conc']])
#             amin = np.argmin(dists).astype(int)
#             if dists[amin]<3:
#                 speaker2id[speaker] = ppldf.iloc[amin]['PersonID']
    df['PersonID'] = df['header'].apply(lambda x: speaker2id.get(x,''))
    return df
    
def load_and_preprocess(meta_row, root_folder, peopleDF):
    filepath = get_full_url(meta_row,root_folder)
    fileContent = pd.read_csv(filepath)
    fileContent = fileContent.applymap(remove_special_character).applymap(remove_person_title)
    fileContent = remove_calls(fileContent)
    fileContent = fix_header(fileContent, meta_row)
    fileContent = fix_speakers_names(fileContent)
    fileContent = add_speaker_id_to_comittee_session(fileContent, peopleDF)
    return fileContent

<u>Iterate over all the rows in the metadata file and process them</u> <br/>
I created a new folder with the edited protocols and dump to file 'speakers.pkl' a dictionarry that maps for each
session who where the speakers found in this session <br/>
<u> note </u> - I think I still have a bug that non-mapped names where mapped to np.nan/None/'' so it need to be changed to be unified

In [5]:
new_parsed_folder = r'D:\knesset'
res = {}
for f in meta.iterrows():
    try:
        sys.stdout.write('.')
        tres = (load_and_preprocess(f[1], root_folder=PROJECT_ROOT, peopleDF = ppldf))
        fulladdress = os.path.join(new_parsed_folder,'speakers',f[1]['parts_parsed_filename'])
        if not os.path.exists(os.path.dirname(fulladdress)):
            os.makedirs(os.path.dirname(fulladdress))
        tres.to_csv(fulladdress)
        res[f[1]['CommitteeSessionID']] = tres['PersonID'].unique()
    except:
        res[f[1]['CommitteeSessionID']] = []
        cs = f[1]['CommitteeSessionID']
        warnings.warn(f'error in {cs}')
print('finished')
res
pickle.dump(res, open( os.path.join(new_parsed_folder,'speakers.pkl'), "wb" ))

'ccc'

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................



........................................................................................................................................................................................................................................................................................................................................................................................



........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................



.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................



.......................................................



........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................



........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................



......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................



.................................................................................................................................................................................................................................................................................



................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................



..................................................................................................................................................................................................................................................................................................................................................



.....................................................................................................................................................................................................



............................................................................................................................................................................



.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................



........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................



........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................



.......................................



......



..............



............



.............



..



................



...................................



..........



...............



................................



......



......



.................................................................................



....



.................................................................



........



..............................



...........................



............



....



.......



.......



...............



..................



.........



...............



..........



.........................



.....



...........



........................



..........



.......



................



.......



........



...............................................



..................



....................



......



.........



..................................................................................................................................................................................



.........................



...........................................................................................................................................................



......................



................



.....



............................



...........................



.............................................................................................................................................................................



......................................................................................................



............................................



......................................................



...



.....



...



...............



.......



.............................



..............................



..............



...........................



................................



........



...



......



.......................................



........



..................



..............................................



.......................



..........



..........



...........



............................................................................................................................



............................



............................................



.....................................................



...........



............................................



.....



.........................................



...................



............



..........................................



.......................



.....................



.........................................



.......................................



......................................................................................................................



..........



...........................



.....



............................................................



.......................



....................................................



...



...................................................................................



.............



.......................................................



....



........................



...........



........................................................................................



.....................................



............................................................



....



.....



.....



.......



..........



...



...



.........



......



...................



......



...



.....



.......



.......



......



........



....



.............



.....



....



....



................................



..........................................



...



...



........



................



...............



...



.....



......



..........................................



...................................



....



......



..............



................



..............



.................................



.........................................................................



...



...



..............



...............................



......



.............................................................



........................



..................................



.......



...



......................................



...............................



........................................



........



......................



............................................



...........................



..........



..........



............



.................



...........................................................



........



...



................................



......................................



.................................



........



........



.....................



.........



....



.................................................................................................................



.............................



...................................



........................................................................................................................................................................



..........................................



...............................................................................



......



....



............



...........



.........



.....



......



........



......................................................................................................................................................................................



..................



...........................................................



..........................................



.....



...



........................................................



........



........



...................



..............



...............................



......



..............................................................................................................



..............................................................................................................................................................................................................................................................................



..................................



......................



.............



...........................



.........



............



...



......



.....



........



............



...



......



...



..............



.......



.........



.......



.....



...



........



........................



....



......



............



.....



.........



.................................



...........................



...............



.......



....................



......



.......



.......



....................



.........



..........



......



..............



..................



..



...........



.............................



.........................................................................................



.....



..............



......



..



.............



.......



......



..................



...



.....



..............



....



........



..............



....



.............



...............



.........



........................



.......................................



................................................................................



........................



........................



.....



...........................



.................................................



...........................



.............................



.................................................................



..................



.........................................................



....................................................................................................................................................................



...................................



................



.........



.......................................



.....



.............



.....................



..................................



............................................



..........................



...



.................................................................................



........



....



....................................................................................................................



...................



.......



.................................................................



......



............



..............



......



......



.......



................................



.....



...................



..........



...........



........................................................



..........



................................................................



...



....................................................................................



............................................................................................



....



.....................................



...



..........................



.........



...........



...



............



.....



..........



...............



...



...



.........



.................................................



.....................................................................................



.......



...................................



.......



..........



......



.......................



...........



.............



...



.........



....................................



.....



............



...............................................................................



......................................



................................................



.............................................................................



....................................................................................



.....



......



...................



......................



.......



................



...



................



..................................................................



...................................



................................................................................................................................................



...



........



............



...................



......................



...............



..............................



.......................



...............................



................



....



...........



............



..............................................................................................................................



.....................



........................................



.........................



...........................................



..........



....



.......................



...................



...



...



.........................



......



..................



.....



......



....



..



.....................................................



........



.......



.........



.........



.........



..............



...



.......



...............



....



....



........



.............................



...............................................................



.............................



...............................................................



..........................



......



......................................



........



.................................................................................................



....



.......



.......................



.................................



.....................................



.........................



...................



...............................................................................



.................................



.....................



..................



...........



.......



........................



........



.........



.....



.......



...........



.........



........................



.....................................................................................................



.....



.....................................................................................................................



.......



......



...................



...



........



..................



..........



............



.......



..................................................................



...........................................................................................................................................



..................



.........



.....



......................



.......



...............................



...



..............



..........



....



.............



...............



...........................



.................



....................................................................................................



................



............



...........................



....



.........



...



.................



....................



.................



........



..............



..............



....................................................................................................................................................................



.....



..........................



..........



................



...........



...........................



...........................................................................................................................................



.....



...........................



..............



.................



.......................................................



............................



.....................................................



.......................................................................................................................



.............................



...............



........................................................................................................................................................................................................................



.................................................



............



......



....



...



............



....



...........



.....



.....................



..................



..



......



....................................



....................



......



......................................



.....................



.............................



.....................................................



..................



.......



...



........................................



...



.......



............................



...



..............



.............................



...........................



.....



.........



........



...........................



....



..........



..............................................................................



.....................



.......................................................................................................................................



..............................



.....................



.........



.........



...



.............



..........



................................................................................



...................................



.................



..........................................................................................



.....................



.....................................



..............



..........



.........



......................................



...........................



........................



...............................................................



....



..



...................



..........................................................................................................................



......



......................................



......



........



................................................



...........................................



...



...................................



.............



....



.......



...



.....



...............



...



.......................................................................................



...



.........



.......



.....



..............................................................................................................................



...



.................



.............................



.......



...



....................



..................................................



...................



...



.....



....................................................



.....................................................................................................



..........................................................



.......................



................



....................



.....................................



......



..



.....



......................................................



.................



.............................



.....................



........................



.............................................................................................



....



............................



.....................



.................



......................................



..........



...........................



......



..



...................................................



.................................................



.....



.............



........



....................



...............................



.......



.....



.....



.....................................................



..



..................



............



........



......



............



........................................................



...................



.......................



...



......................................



..................................



..........



.........................................



........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................



....................................................................................................................................................



..................................................................................................................



........................



.........................................................................................



.........................................................................................................................................................................................................................................................................................



...............................................................................................................



..............................................................................................................................



..................................



....................................................................



......................



.....



.........................................................



...................................



.........



..............................................................................................................................................................



.............



.............



.........................................................................................................................................................................



...........................................................................



.............



................................



.........................................



...........



................................................................................................................................................................



.................................



......................



.....



........................................................................



..............



.....................................................



........................



......................................



.................................................



.................



............



........................................



....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................finished
