# Latent Semantic Analysis using nouns and noun phrases from <code>ds_PWDB</code> dataset

Text data suffers heavily from high-dimensionality. Latent Semantic Analysis (LSA) is a popular,
dimensionality-reduction techniques that follows the same method as Singular Value Decomposition.
LSA ultimately reformulates text data in terms of r <b>latent</b> (i.e. <b>hidden</b>) features, where r is less than m,
the number of terms in the data. We will use <code>ds_PWDB</code> textual data to preform the LDA model.

### Import the libraries

In [1]:
from typing import List
import warnings
warnings.filterwarnings('ignore')

import spacy
nlp = spacy.load("en_core_web_sm")
stop_words = nlp.Defaults.stop_words

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from sem_covid.services.data_registry import Dataset
from sem_covid.entrypoints.notebooks.topic_modeling.topic_modeling_wrangling.topic_visualizer import \
    plotly_bar_chart_graphic
from sem_covid.services.sc_wrangling.data_cleaning import clean_text_from_specific_characters, clean_remove_stopwords
from sem_covid.entrypoints.notebooks.topic_modeling.topic_modeling_wrangling.tfidf_vectorization import vectorize_documents
from sem_covid.entrypoints.notebooks.topic_modeling.topic_modeling_wrangling.lsa_transformer import lsa_document_transformation


`scipy.sparse.sparsetools` is deprecated!
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.

Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


### Download the dataset

In [2]:
pwdb = Dataset.PWDB.fetch()
pwdb.head()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.

100% (1288 of 1288) |####################| Elapsed Time: 0:00:01 Time:  0:00:01


Unnamed: 0_level_0,identifier,title,title_national_language,country,start_date,end_date,date_type,type_of_measure,status_of_regulation,category,...,funding,involvement_of_social_partners_description,social_partner_involvement_form,social_partner_role,is_sector_specific,private_or_public_sector,is_occupation_specific,sectors,occupations,sources
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
adc5c75937bc7f7198f534d08b85bd50c9521bfd3f319a090932b5d0bae54de0,1297,Agreement on a teleworking regime,Convention du 20 octobre 2020 relative au régi...,Luxembourg,10/20/2020,,Open ended,Bipartite collective agreements,Entirely new measure,"Protection of workers, adaptation of workplace",...,[Companies],The agreement is a social partner initiative w...,,,No,Only private sector,No,[],[],[{'title': 'Accord signé entre partenaires soc...
2372d71eb9ad6e6a70982e02bbe802db004ed49d91b2264c0a2e8e41571002cc,864,Special protection for COVID-19 risk groups at...,Besonderer Schutz von Risikogruppen,Austria,05/06/2020,05/31/2021,Temporary,Legislations or other statutory regulations,Entirely new measure,"Protection of workers, adaptation of workplace",...,[National funds],The expert group which elaborated the definiti...,,,No,Not specified,No,[],[],"[{'title': 'FAQs Risk groups', 'url': 'https:/..."
8735e268191e9e5cbd3d2a44ca53d297e31746b5f1e24b941db6225a25848353,1228,Funds for innovative renewable projects in And...,Ayudas para proyectos de renovables en Andaluc...,Spain,08/25/2020,,Open ended,Legislations or other statutory regulations,Entirely new measure,"Promoting the economic, labour market and soci...",...,"[Employer, European Funds, Local funds, Nation...",No involvement reported,,,Yes,Not specified,No,"[Electricity, gas, steam and air conditioning ...",[],[{'title': 'LÍNEAS DE AYUDA PARA PROYECTOS REN...
18bcd22116c46919e03a3345f793c3859855227ac942e69dd13cbfcd588e1044,183,Waiver of advance payments for social and heal...,Odklad záloh sociálního a zdravotního pojištěn...,Czechia,03/01/2020,08/31/2020,Temporary,Legislations or other statutory regulations,Entirely new measure,Supporting businesses to stay afloat,...,[No special funding required],"Social partners, who are members of the tripar...",,,No,Not specified,No,[],[],[{'title': 'CSSZ: Self-employed persons - coro...
b94d8aa95fbdeb1bb832b01fbe5d6e9bf9fc36fceb14f7ba370a963f472fe35b,1550,Financial Shield 2.0: Small and medium-sized e...,Tarcza Finansowa 2.0: małe i średnie przedsięb...,Poland,01/01/2021,,Temporary,Legislations or other statutory regulations,New aspects included into existing measure,Supporting businesses to stay afloat,...,[National funds],None.,,,Yes,Only private sector,No,"[Manufacture of paper and paper products, Prin...",[],[{'title': 'The Act of 31 March 2020 amending ...


### Prepare the data
After importing and visualize the dataset table, we detected the columns that contains textual data and concatenate
them to have an entire text to wotk with.

In [3]:
pwdb_descriptive_data = pwdb['title'].map(str) + ' ' + \
                        pwdb['background_info_description'].map(str) + ' ' + \
                        pwdb['content_of_measure_description'].map(str) + ' ' + \
                        pwdb['use_of_measure_description'] + ' ' + \
                        pwdb['involvement_of_social_partners_description']

pwdb_descriptive_data


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



_id
adc5c75937bc7f7198f534d08b85bd50c9521bfd3f319a090932b5d0bae54de0    Agreement on a teleworking regime During the C...
2372d71eb9ad6e6a70982e02bbe802db004ed49d91b2264c0a2e8e41571002cc    Special protection for COVID-19 risk groups at...
8735e268191e9e5cbd3d2a44ca53d297e31746b5f1e24b941db6225a25848353    Funds for innovative renewable projects in And...
18bcd22116c46919e03a3345f793c3859855227ac942e69dd13cbfcd588e1044    Waiver of advance payments for social and heal...
b94d8aa95fbdeb1bb832b01fbe5d6e9bf9fc36fceb14f7ba370a963f472fe35b    Financial Shield 2.0: Small and medium-sized e...
                                                                                          ...                        
cb014a456b14c3621dd318a12e611f70c2a9636be9fe181072bd4bf5917a40fa    Automatic extension of unemployment benefits D...
d233b17dc2b98f14269c2b22be78d93ec5ccf2a0013b86f09175c69353c5800b    Extra subsidies to institutions in the cultura...
77d7e3c52aaf78bdfb1a1667641db1293bbff862440c547fc3f6

### Clean the data

After we extract the data, one of the most important step is cleaning them. We will delete below characters
from our text to make it cleaner.

In [4]:
unused_characters = ["\\r", ">", "\n", "\\", "<", "''", "%", "...", "\'", '"', "(", "\n"]
clean_text = clean_text_from_specific_characters(pwdb_descriptive_data, unused_characters)
clean_text


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



'[Agreement teleworking regime During COVID-19 crisis, teleworking identified vital pillar companies working prevent social hardship. Discussions representative social partners OGBL LCGB employer association UEL, Ministry Work Employment, led instance joint assessment teleworking level Economic Social Council CES. From there, discussions continued social partners inter-professional agreement signed social partners October 2020. Applied period years covering sectors Luxembourg with exception transport), agreement provides definition teleworking: * Teleworking identified form organisational work, conducted digital means usually company, transferred location employee lives. * The work considered teleworking applied occasional exceptional circumstances, remains 10 threshold annual working time. * Teleworking based written agreement employer employee, containing compulsory elements, example location telework takes place number hours, employee cant dismissed he/she accept teleworking scheme 

### Extract nouns and noun phrases

In [5]:
doc = nlp(clean_text)

nouns = [token for token in doc if token.pos_ == "NOUN"]
str_nouns = [str(noun) for noun in nouns]
noun_phrases = [word.text for word in doc.noun_chunks]


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



### TFIDF Training

Our models work on numbers, not string! So we tokenise the text (turning all documents into smaller observational
entities — in this case words) and then turn them into numbers using Sklearn’s TF-IDF vectoriser.
This should gives us our vectorised text data — the document-term matrix.

Now let’s visualise the singular values

In [7]:
noun_phrases_vectors = vectorize_documents(noun_phrases)
noun_phrases_vectors


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Unnamed: 0,19,agreement,covid,employer,group,measure,partners,representatives,risk,security,social,wage,work,workers,working
0,0.707107,0.0,0.707107,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,0.000000,0.0,0.000000,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,0.000000,0.0,0.000000,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,1.000000,0.0,0.0,0.0,0.0
3,0.000000,0.0,0.000000,0.0,0.0,0.0,0.76067,0.0,0.0,0.0,0.649139,0.0,0.0,0.0,0.0
4,0.000000,0.0,0.000000,1.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287,0.000000,0.0,0.000000,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
288,0.000000,0.0,0.000000,0.0,0.0,0.0,0.76067,0.0,0.0,0.0,0.649139,0.0,0.0,0.0,0.0
289,0.000000,0.0,0.000000,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
290,0.000000,0.0,0.000000,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


In [8]:
noun_vectors = vectorize_documents(str_nouns)
noun_vectors


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Unnamed: 0,19,covid,risk,wage
0,0.000000,0.000000,0.0,0.0
1,0.000000,0.000000,0.0,0.0
2,0.707107,0.707107,0.0,0.0
3,0.000000,0.000000,0.0,0.0
4,0.000000,0.000000,0.0,0.0
...,...,...,...,...
526,0.000000,0.000000,0.0,0.0
527,0.000000,0.000000,0.0,0.0
528,0.000000,0.000000,0.0,0.0
529,0.000000,0.000000,0.0,1.0


### LDA Modeling

Let’s explore our reduced data through the term-topic matrix, V-tranpose. TruncatedSVD will return it to as a numpy
array of shape (num_documents, num_components), so we’ll turn it into a Pandas dataframe for ease of manipulation.

Let’s slice our term-topic matrix into Pandas Series (single column data-frames), sort them by value and plot them.
The code below plots this for our 2nd latent component (recall that in python we start counting from 0)
and returns the plot.

In [17]:
noun_phrases_lsa = lsa_document_transformation(noun_phrases_vectors, num_components=14)
noun_lsa = lsa_document_transformation(noun_vectors, num_components=3)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [18]:
singular_values_noun_phrases_plot = plotly_bar_chart_graphic("Singular values", list(range(len(noun_phrases_lsa[1]))),
                                                noun_phrases_lsa[1], "Latent component",
                                                "Relative importance of each component")
singular_values_noun_phrases_plot


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [19]:
singular_values_noun_plot = plotly_bar_chart_graphic("Singular values", list(range(len(noun_lsa[1]))),
                                                noun_lsa[1], "Latent component", "Relative importance of each component")

singular_values_noun_plot


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [20]:
noun_phrases_term_topic_matrix = pd.DataFrame(data=noun_phrases_lsa[2], index=noun_phrases_vectors.columns,
                                              columns= [f'Latent_concept_{r}' for r in range(0 , noun_phrases_lsa[2].shape[1])])
noun_phrases_term_topic_matrix


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Unnamed: 0,Latent_concept_0,Latent_concept_1,Latent_concept_2,Latent_concept_3,Latent_concept_4,Latent_concept_5,Latent_concept_6,Latent_concept_7,Latent_concept_8,Latent_concept_9,Latent_concept_10,Latent_concept_11,Latent_concept_12,Latent_concept_13
19,0.4923897,-0.1010477,-0.05398702,-0.1735063,1.016837e-14,0.3858969,0.2556504,0.0001005462,0.001721313,-0.0006904004,-0.0001081856,0.006740114,0.0005142945,8.476732e-05
agreement,0.008204981,0.04802025,-0.02269724,0.0002030607,-4.217989e-16,-0.0007075427,-0.000598894,0.2631373,7.996132e-06,0.03958474,0.9500491,0.0336331,-0.1341876,-0.06761446
covid,0.4923897,-0.1010477,-0.05398702,-0.1735063,1.017527e-14,0.3858969,0.2556504,0.0001005462,0.001721313,-0.0006904004,-0.0001081856,0.006740114,0.0005142945,8.476732e-05
employer,0.0007187437,0.004375693,-0.003249745,4.779925e-05,-1.683645e-16,-0.0002539946,-0.0003279844,0.9644138,-3.809827e-06,-0.01452237,-0.2625968,-0.006792089,0.02487752,0.006537126
group,0.2513633,-0.02762242,0.06994063,0.03383462,-8.122948e-15,-0.1902549,-0.19436,0.000223264,-0.006529118,-0.01293202,-0.00372932,0.9043882,0.1946383,-0.004296261
measure,-1.466926e-17,1.905981e-16,-1.804371e-16,4.241789e-16,1.0,5.935478e-15,-4.845377e-14,2.223446e-16,2.332639e-16,1.6799870000000002e-17,5.762426e-16,-5.235713000000001e-17,-1.344878e-16,-1.444689e-16
partners,0.08025754,0.4713412,-0.2328232,0.002197204,-5.783557e-16,-0.0079652,-0.00687239,-0.02430171,-0.0003176644,-0.3461055,-0.1295514,0.125132,-0.5854021,-0.4706985
representatives,0.01698057,0.09834091,-0.04001974,0.0002780077,-1.840407e-16,-0.0006914363,-0.0003367996,-0.003948241,0.0002528327,0.9117134,-0.1011121,0.09325484,-0.3701076,-0.03607855
risk,0.6376759,-0.1242786,-0.02316718,0.1599615,-1.628271e-14,-0.5084377,-0.3964431,-0.0003203205,-0.003080328,0.007115194,0.001775775,-0.3613745,-0.07394471,0.001071398
security,0.04511632,0.2363026,0.006417759,-0.0007887824,1.145749e-16,0.00413874,0.004234286,-0.00420847,0.0002592846,0.2093534,0.02045972,-0.1355164,0.6096396,-0.7125717


In [21]:
noun_term_topic_matrix = pd.DataFrame(data=noun_lsa[2], index=noun_vectors.columns,
                                              columns= [f'Latent_concept_{r}' for r in range(0 , noun_lsa[2].shape[1])])
noun_term_topic_matrix


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Unnamed: 0,Latent_concept_0,Latent_concept_1,Latent_concept_2
19,0.0,0.7071068,-1.003637e-22
covid,0.0,0.7071068,-1.003637e-22
risk,1.0,0.0,-0.0
wage,0.0,1.419357e-22,1.0


In [22]:
class TermTopicMatrix(object):
    def __init__(self, matrix: pd.DataFrame, num_concepts: int):
        self.matrix = matrix
        self.num_concepts = num_concepts

    def top_term_topic(self):
        for number in range(self.num_concepts):
            data = self.matrix[f'Latent_concept_{number}']
            data = data.sort_values(ascending=False)
            top_10 = data[:10]

            plotly_bar_chart_graphic(f"Top term along the axis of latent concept {number}",
                                 top_10.index, top_10.values, "Noun phrases", "LSA Term Topic")

    def bottom_term_topic(self):
        for number in range(self.num_concepts):
            data = self.matrix[f'Latent_concept_{number}']
            data = data.sort_values(ascending=False)
            bottom_10 = data[10:]

            plotly_bar_chart_graphic(f"Bottom term along the axis of latent concept {number}",
                                 bottom_10.index, bottom_10.values, "Noun phrases", "LSA Term Topic")


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [25]:
noun_phrases_topic_matrix = TermTopicMatrix(noun_phrases_term_topic_matrix, 14)
noun_topic_matrix = TermTopicMatrix(noun_term_topic_matrix, 3)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



### Top topics from noun phrases

In [26]:
noun_phrases_top_topic = noun_phrases_topic_matrix.top_term_topic()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



### Bottom topics from noun phrases

In [27]:
noun_phrases_bottom_topic = noun_phrases_topic_matrix.bottom_term_topic()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



### Top topics from nouns

In [28]:
noun_top_topic = noun_topic_matrix.top_term_topic()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



### Bottom topics from nouns

In [29]:
noun_bottom_topic = noun_topic_matrix.bottom_term_topic()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.

