In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import math
# !pip install transformers
from transformers import pipeline
pd.set_option('display.max_colwidth', None)

In [None]:
df = pd.read_csv("C:/Users/lasko/Documents/Bocconi/2nd Semester/Natural Language Processing/Final Project/company_aspect_matrix_with_counts.csv")

Company aspect

In [None]:
df.head()

##### Short EDA for company aspects

In [None]:
df['n_mentions'].min(), df['n_mentions'].max()

In [None]:
bins = list(range(0, 110, 5))

plt.figure(figsize=(8,5))
plt.hist(df['n_mentions'], bins=bins, edgecolor='black', align='left')
plt.title('Distribution of Number of Mentions')
plt.xlabel('Number of Mentions')
plt.ylabel('Frequency')
plt.xticks(bins)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


In terms of aspects, they are most often mentioned up to 5 times.

What if we were to drop the aspects for which the number of mentions is lower than 3?

In [None]:
# Count number of aspects per firm
aspect_count = df.groupby('firm')['aspect'].count().reset_index(name='n_aspects')
aspect_count.sort_values('n_aspects', ascending=False)

In [None]:
print(aspect_count[aspect_count['n_aspects'] < 5])
len(aspect_count[aspect_count['n_aspects'] < 5])

In [None]:
# Drop rows where n_mentions < 3
filtered_df = df[df['n_mentions'] >= 3]

# Count number of aspects per firm after filtering
aspect_count_filtered = filtered_df.groupby('firm')['aspect'].count().reset_index(name='n_aspects_filtered')
sorted_aspect_count_filtered = aspect_count_filtered.sort_values('n_aspects_filtered', ascending=False)
sorted_aspect_count_filtered[sorted_aspect_count_filtered['n_aspects_filtered'] < 3]


In [None]:
comparison = aspect_count.merge(aspect_count_filtered, on='firm', how='left')

comparison['n_dropped'] = comparison['n_aspects'] - comparison['n_aspects_filtered']

comparison['n_dropped'].min(), comparison['n_dropped'].max()


**INSIGHTS**:
- In order not to lose too many aspects due to their low number of mentions, we're **discarding the filtering based on n_mentions**
- To maximize the best fit based on aspects, we're **filtering out companies (17) that have less than 5 aspects** mentioned in the reviews.

Company aspect - pivoted

In [None]:
df_pivot = df.pivot(
    index='firm',
    columns='aspect',
    values='avg_star_rating'
)

In [None]:
df_pivot.head()

## Summarization - pros & cons

In [None]:
df_reviews = pd.read_csv("C:/Users/lasko/Documents/Bocconi/2nd Semester/Natural Language Processing/Final Project/comparison_cleans.csv")
df_reviews.head()

In [None]:
pros = df_reviews['pros_clean_min']
cons = df_reviews['cons_clean_min']

Grouping by company:

In [None]:
company_pros = df_reviews.groupby('firm')['pros_clean_min'].apply(lambda texts: ' '.join(texts)).reset_index()
company_cons = df_reviews.groupby('firm')['cons_clean_min'].apply(lambda texts: ' '.join(texts)).reset_index()

company_pros.head()

In [None]:
len(company_pros)

##### LLM for summarization

In [27]:
from transformers import pipeline

summarizer = pipeline(
    "summarization",
    model="google/pegasus-cnn_dailymail",
    device=-1
)

pytorch_model.bin:  30%|##9       | 682M/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Device set to use cpu


Example to summarize - Accenture:

In [28]:
company_pros['pros_summary'] = None

company_pros.at[2, 'pros_summary'] = summarizer(
    company_pros.at[2, 'pros_clean_min'],
    max_length=60, 
    min_length=20, 
    do_sample=False
)[0]['summary_text']


In [29]:
company_pros.head()

Unnamed: 0,firm,pros_clean_min,pros_summary
0,ASOS,"Informal environment and a lot of holiday Perks and good environment great benefits, diverse and international team You get to go home. Eventually.",
1,AXA UK,"Forward thinking, people centred organisation Family environment, friendly company, interesting work Room to progress Lots of initiatives",
2,Accenture,"It's very people-centric and there are plenty of career advancement and personal development opportunities Good medical insurance benefits, good variety of clients and service offerings Fantastic people challenging environment Sets you up with the basics for the rest of your career People are incredibly collaborative and willing to help you, whatever the problem. Always very intelligent people so its easy to delegate work and expect a good job to be done. Great benefits and I have been able to take flexible working options now I have kids, without compromising my career in Consulting. - It's useful if you don't know what else you want to do as you'll get a good idea of different roles and industries. - If you are willing to play the game, you can have a career here. You get to work with driven and talented people The caliber of people is very high. Good for socials and an inclusive work environment with lots of arranges activities and drinks. Variety of sectors / areas you can get involved in dependent upon interest / skills Easy to get in as a graduate, big clients Opportunities, progression, finance, promotion, big customers Its culture and values are great Its policies make you feel that they really care for you Great company with lots of opportunities and variety Very flexible to manage your time and effort Good name in the market - Incredibly bright and enthusiastic people to work with - Industry-leading graduate programme - Sense of belonging and overall community feel (predominantly from Consultant level and above) - Multinational FTSE 500 clients on their biggest business and technology challenges - Continuous improvement on existing methodologies (focus on Design Thinking and empathy-based decision making) and overall focus on innovation - Thorough internal training portal and budget for external training - Restructured performance system, aimed at fairer representation good pay compared to other tech services companies",It's very people-centric and there are plenty of career advancement and personal development opportunities .<n>The caliber of people is very high. Good for socials and an inclusive work environment .<n>Its culture and values are great. Its policies make you feel that they really care for
3,Accor,"Friendly, Fast pace, Efficient, Challenging, Interesting there where no positives in working there There was literally nothing good. Work like a dog but dont expect a payrise or a promotion even if you have been working there for a few years now.",
4,Adecco,"Clear progression routes, great collaboration, brilliant corporate framework and support",


In [None]:
len(company_pros)

##### Google - **google/flan-t5-large**

In [3]:
!pip uninstall numpy

^C


In [7]:
!pip install numpy==2.0

Collecting numpy==2.0
  Downloading numpy-2.0.0-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.9 kB ? eta -:--:--
     ------ --------------------------------- 10.2/60.9 kB ? eta -:--:--
     ------ --------------------------------- 10.2/60.9 kB ? eta -:--:--
     ------------ ------------------------- 20.5/60.9 kB 110.1 kB/s eta 0:00:01
     ------------------- ------------------ 30.7/60.9 kB 131.3 kB/s eta 0:00:01
     ------------------------- ------------ 41.0/60.9 kB 164.3 kB/s eta 0:00:01
     ------------------------------- ------ 51.2/60.9 kB 175.0 kB/s eta 0:00:01
     -------------------------------------- 60.9/60.9 kB 162.4 kB/s eta 0:00:00
Downloading numpy-2.0.0-cp312-cp312-win_amd64.whl (16.2 MB)
   ---------------------------------------- 0.0/16.2 MB ? eta -:--:--
   ---------------------------------------- 0.0/16.2 MB ? eta -:--:--
   ---------------------------------------- 0.0/16.2 MB ? eta -:--:--
   ------------------

  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
contourpy 1.2.0 requires numpy<2.0,>=1.20, but you have numpy 2.0.0 which is incompatible.
numba 0.59.1 requires numpy<1.27,>=1.22, but you have numpy 2.0.0 which is incompatible.
pywavelets 1.5.0 requires numpy<2.0,>=1.22.4, but you have numpy 2.0.0 which is incompatible.
streamlit 1.32.0 requires numpy<2,>=1.19.3, but you have numpy 2.0.0 which is incompatible.
streamlit 1.32.0 requires packaging<24,>=16.8, but you have packaging 24.1 which is incompatible.
streamlit 1.32.0 requires protobuf<5,>=3.20, but you have protobuf 5.29.4 which is incompatible.


In [8]:
from transformers import pipeline
pipe = pipeline("text2text-generation", model="google/flan-t5-large")


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\lasko\AppData\Roaming\Python\Python312\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\lasko\AppData\Roaming\Python\Python312\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\lasko\AppData\Roaming\Python\Python312\site-packages\ipykernel\kernelapp.py", line 739, in start
    self.io_lo

ImportError: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.



RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
numpy.core.multiarray failed to import

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")

##### Summarization of **pros** per company

In [None]:
company_pros['pros_summary'] = company_pros['pros_clean_min'].apply(
    lambda text: summarizer(text, max_length=60, min_length=20, do_sample=False)[0]['summary_text']
)

##### Summarization of **cons** per company

In [None]:
company_cons['cons_summary'] = company_cons['cons_clean_min'].apply(
    lambda text: summarizer(text, max_length=60, min_length=20, do_sample=False)[0]['summary_text']
)

## Ratings - preparation

## Output formatting