<a href="https://colab.research.google.com/github/hiwan/Text_Analysis_Final_Project/blob/main/project_text_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What determines corporate risk attitudes?

# 1. Introduction

# 1-1.  Research Question

- The research question of this project is *'What determines corporate risk attitudes?'*

- Given that American companies tend to be risk-seeking while Japanese companies are risk-averse, this project aims to analyze news articles mentioning these firms to identify what factors influence corporate risk attitudes. Specifically, this project will determine what makes a company risk-seeking or risk-averse.

- In academic studies, the standard deviations of return on asset (Std of ROA) reflect the corporate risk attitude, meaning higher Std of ROA is associated with more risk-averse and vice versa. American companies with 0.088 Std of ROA are risk appetite and rank second among G7 countries, while Japanese firms with 0.022 Std of ROA are risk-averse and rank bottom among G7 countries [(Noma, 2021)](https://www.hit-u.ac.jp/hq-mag/research_issues/430_20210701/). This study indicates American firms are more risk-seeking than Japanese ones.

- For the cause of the risk attitudes, one study argues that differences in creditor protection systems - where bankruptcy is easier in the U.S. and more difficult in Japan — influence the varying risk attitudes between American and Japanese companies. [(Acharya, Amihud and Litov, 2011)](https://www-sciencedirect-com.ezproxy.cul.columbia.edu/science/article/pii/S0304405X11001012)

- However, is this the only cause of the risk attitudes? This project presents a new hypothesis as outlined in sections 1-2 and tests this hypothesis by analyzing news articles that mention American and Japanese companies.

# 1-2. Hypothesis

- The hypothesis of this project is that companies become risk-seeking when they prioritize shareholders and risk-averse when they prioritize employees.

- The laws of Delaware, where two-thirds of the S&P 500 companies are incorporated [(Simmerman, Chandler III, Berger,Goodrich, and Rosati, 2024)](https://corpgov.law.harvard.edu/2024/05/08/delawares-status-as-the-favored-corporate-home-reflections-and-considerations/), adopt the idea that firms should prioritize the interests of shareholders over those of other stakeholders, such as employees [(Hinricks, 2019)](https://blogs.cuit.columbia.edu/millsteincenter/2019/06/26/does-and-should-delaware-law-allow-long-term-stakeholder-governance/#:~:text=Although%20the%20law%20allows%20corporate,above%20the%20interests%20of%20shareholders.). Under this law, managers may be able to take risks and generate returns for shareholders without worrying about maintaining employment.

- By contrast, in Japan, while the number of companies abandoning the system of lifetime employment has been increasing recently, the majority of companies have traditionally adopted the system of lifetime employment [(Ono, 2010)](https://www-sciencedirect-com.ezproxy.cul.columbia.edu/science/article/pii/S0889158309000598). Since companies that employ a system of lifetime employment value their employees, they may try to avoid taking risks and going bankrupt.

- Thus, while American companies may take risks in pursuit of growth to maximize shareholder profits, Japanese companies may not get involved in risky businesses to maintain employment.

# 1-3. Importance of the Topic

- Making risk-averse Japanese companies more risk-seeking is important for reviving the Japanese economy, which is struggling with prolonged stagnation.

- One of the reasons for Japan's 30 years of economic stagnation is risk-averse companies because the more risk a company takes, the higher its rate of return . In contrast, American companies have taken risks and created numerous innovations, such as cloud-computing and artifical intelligence, contributing to its growing economy.

- The Japanese government’s initiatives to encourage risk-taking, such as creating a national investment fund for innovation, have failed to make Japanese companies more risk appetite. Identifying the drivers for risk-taking by comparing American and Japanese companies’ attitudes will show the root cause of the risk-averse attitudes in Japan.

- Therefore, this research is relevant and essential to updating Japanese public policies of making firms more risk-seeking.

# 2. Method

## Step 1. Selection of Target Companies
- Identify ten American risk-appetite firms with large Std of ROA from the S&P 500, which includes major American firms.
- Idenfity five Japanese risk-averse companies with small Std of ROA from the Nikkei 225, which includes major Japanese firms.
- Use the [Orbis database](https://www.moodys.com/web/en/us/capabilities/company-reference-data/orbis.html?cid=ppc-gglds-16891&gad_source=1&gclid=CjwKCAiA9vS6BhA9EiwAJpnXw0bDOxaaJ82oy1HZh_nHo36OWxBGinwqbt1sOY0_617gbYPp2EH9oBoC6ogQAvD_BwE&gclsrc=aw.ds), which contains financial information on listed companies in the world.

## Step 2. Data Collection
- Aggregate articles related to risk from [Factiva](https://global-factiva-com.ezproxy.cul.columbia.edu/sb/default.aspx?lnep=hp) by searching for risk-related words, such as "risk".
- Aggregate these companies’ articles from [Factiva](https://global-factiva-com.ezproxy.cul.columbia.edu/sb/default.aspx?lnep=hp) by searching for company names. Cover the period from 2013 to 2018 when there were few external factors, such as the Great East Japan Earthquake in 2011, the subsequent economic downturn in 2012,  and the global pandemic in 2019, that might have affected corporate attitudes.

## Step 3. Data Cleaning
- Clean the text data of the articles by removing capitalization, punctuation, and stop words, as well as tokenizing and lemmatizing words.

## Step 4. Data Analysis to Identify Risk-related Words
- Pick up risk-related words by topic modeling the articles related to risk selected in Step 2.
- Modify the list of risk-related words to exclude words that are not related to risk and add some risk-related words for vefifying research questions.

## Step 5. Data Analysis to Idenfity Factors that Affect Risk Attitudes
- Identify which topics are most popular among risk-seeking companies and which are most popular among risk-averse companies by TF-IDF.

## Step 6. Data Visualization
- Create heat maps representing the topics' names based on their TF-IDF for both American and Japanese companies to see which topics American and Japanese firms prioritize.

## Step 7. Conclusion and Policy Implication
- Idenfity what factors encourage companies to be more risk-seeking and to be more risk-averse.
- Propose public policies to make Japanese companies more risk-seeking based on the identified topics that affect corporate risk attitudes.


# 3. Analysis

## Step 1. Selection of Target Companies

- Identify risk-seeking American companies by picking up companies in the top 25% of Std of ROA among the top 500 U.S. companies by sales from the data of [Orbis](https://www.moodys.com/web/en/us/capabilities/company-reference-data/orbis.html?cid=ppc-gglds-16891&gad_source=1&gclid=CjwKCAiA9vS6BhA9EiwAJpnXw0bDOxaaJ82oy1HZh_nHo36OWxBGinwqbt1sOY0_617gbYPp2EH9oBoC6ogQAvD_BwE&gclsrc=aw.ds).
- Select one representative risk-seeking company from each sector with the largest sales (the company with the smallest id).

In [1]:
import pandas as pd

In [2]:
# Identify American companies in the top 25% of Std of ROA among the top 500 U.S. companies by sales.
top_500_us = pd.read_excel('/content/top_500_us.xlsx')
q3 = top_500_us['Std of ROA'].quantile(0.75)
risk_seeking_firms = top_500_us[top_500_us['Std of ROA'] >= q3]
risk_seeking_firms.head()

Unnamed: 0,No,Name,Sector,Std of ROA
1,2.0,"AMAZON.COM, INC.",Retail,0.215866
2,3.0,APPLE INC.,Computer Hardware,0.11532
4,5.0,EXXON MOBIL CORP,"Chemicals, Petroleum, Rubber & Plastic",0.053457
6,7.0,ALPHABET INC.,Media & Broadcasting,0.053254
12,13.0,MICROSOFT CORPORATION,"Industrial, Electric & Electronic Machinery",0.052224


In [10]:
# Select one representative company from each sector with the largest sales (the company with the smallest id).
risk_seeking_firms_bysector = risk_seeking_firms.loc[risk_seeking_firms.groupby('Sector')['No'].idxmin()]

# Add column of country.
risk_seeking_firms_bysector['Country'] = 'U.S.'

risk_seeking_firms_bysector.head()

Unnamed: 0,No,Name,Sector,Std of ROA,Country
138,139.0,HEALTH CARE SERVICE CORPORATION GROUP(US MARKE...,"Banking, Insurance & Financial Services",0.045619,U.S.
439,440.0,BOOKING HOLDINGS INC.,Business Services,0.729338,U.S.
4,5.0,EXXON MOBIL CORP,"Chemicals, Petroleum, Rubber & Plastic",0.053457,U.S.
38,39.0,AT&T INC.,Communications,0.046481,U.S.
2,3.0,APPLE INC.,Computer Hardware,0.11532,U.S.


- Identify risk-averse Japanese companies by picking up companies in the bottom 25% of Std of ROA among the top 500 Japanese companies by sales from the data of [Orbis](https://www.moodys.com/web/en/us/capabilities/company-reference-data/orbis.html?cid=ppc-gglds-16891&gad_source=1&gclid=CjwKCAiA9vS6BhA9EiwAJpnXw0bDOxaaJ82oy1HZh_nHo36OWxBGinwqbt1sOY0_617gbYPp2EH9oBoC6ogQAvD_BwE&gclsrc=aw.ds).
- Select one representative risk-averse company from each sector with the largest sales (the company with the smallest id).

In [4]:
# Identify Japanese companies in the bottom 25% of Std of ROA among the top 500 Japanese companies by sales.
top_500_jp = pd.read_excel('/content/top_500_jp.xlsx')
q1 = top_500_us['Std of ROA'].quantile(0.25)
risk_averse_firms = top_500_jp[top_500_jp['Std of ROA'] <= q1]
risk_averse_firms.head()

Unnamed: 0,No,Name,Sector,Std of ROA
13,14.0,JAPAN POST HOLDING CO LTD,Business Services,0.000457
14,15.0,SEVEN & I HOLDINGS CO LTD,Retail,0.006528
18,19.0,AEON CO LTD,Retail,0.007504
22,23.0,NIPPON LIFE INSURANCE CO.,"Banking, Insurance & Financial Services",0.001111
24,25.0,DAI-ICHI LIFE HOLDINGS INC.,"Banking, Insurance & Financial Services",0.001796


In [11]:
# Select one representative risk-averse company from each sector with the largest sales (the company with the smallest id).
risk_averse_firms_bysector = risk_averse_firms.loc[risk_averse_firms.groupby('Sector')['No'].idxmin()]

# Add column of country.
risk_averse_firms_bysector['Country'] = 'Japan'

risk_averse_firms_bysector.head()

Unnamed: 0,No,Name,Sector,Std of ROA,Country
22,23.0,NIPPON LIFE INSURANCE CO.,"Banking, Insurance & Financial Services",0.001111,Japan
13,14.0,JAPAN POST HOLDING CO LTD,Business Services,0.000457,Japan
291,292.0,AIR WATER INC,"Chemicals, Petroleum, Rubber & Plastic",0.005723,Japan
383,384.0,INFRONEER HOLDINGS INC.,Construction,0.004551,Japan
181,182.0,SUNTORY BEVERAGE & FOOD LTD,Food & Tobacco Manufacturing,0.009931,Japan


- Combine these two datasets to compare companies in the same sector in Japan and the United States.

In [12]:
# Combine two detasets vertically.
df_combined = pd.concat([risk_seeking_firms_bysector, risk_averse_firms_bysector], axis=0, ignore_index=True)
sector_counts = df_combined['Sector'].value_counts()
df_multiple_sectors = df_combined[df_combined['Sector'].isin(sector_counts[sector_counts > 1].index)]
df_multiple_sectors.sort_values(by='Sector', ascending=True)

Unnamed: 0,No,Name,Sector,Std of ROA,Country
0,139.0,HEALTH CARE SERVICE CORPORATION GROUP(US MARKE...,"Banking, Insurance & Financial Services",0.045619,U.S.
20,23.0,NIPPON LIFE INSURANCE CO.,"Banking, Insurance & Financial Services",0.001111,Japan
1,440.0,BOOKING HOLDINGS INC.,Business Services,0.729338,U.S.
21,14.0,JAPAN POST HOLDING CO LTD,Business Services,0.000457,Japan
2,5.0,EXXON MOBIL CORP,"Chemicals, Petroleum, Rubber & Plastic",0.053457,U.S.
22,292.0,AIR WATER INC,"Chemicals, Petroleum, Rubber & Plastic",0.005723,Japan
6,243.0,"D.R. HORTON, INC.",Construction,0.093332,U.S.
23,384.0,INFRONEER HOLDINGS INC.,Construction,0.004551,Japan
7,196.0,COCA-COLA COMPANY (THE),Food & Tobacco Manufacturing,0.048408,U.S.
24,182.0,SUNTORY BEVERAGE & FOOD LTD,Food & Tobacco Manufacturing,0.009931,Japan


- This result shows risk-seeking Japanese companies and risk-averse American companies in each sector.

- This project searches [Factiva](https://global-factiva-com.ezproxy.cul.columbia.edu/sb/default.aspx?lnep=hp) for news articles from 2013 to 2018 for all these companies. However, it was unable to find more than 10 articles for the following firms from four sectors.

| Name | Sector |
| ----------- | ----------- |
| HEALTH CARE SERVICE CORPORATION GROUP | Banking, Insurance & Financial Services |
| INFRONEER HOLDINGS INC | Construction |
| NATIONAL FEDERATION OF AGRICULTURAL CO-OPERATIVE ASSOCIATIONS | Public Administration, Education, Health Social Services |
| LOGISTEED, LTD. | Transport, Freight & Storage |

- Therefore, to ensure a sufficient number of articles for analysis, it excluded above sectors and selected 20 companies from 10 sectors in both Japan and the U.S. for the research. This project about 100 articles from these 20 companies.

In [117]:
# Pick up 20 companies from 10 sectors, excluding sectors where sufficient data cannot be collected.
selected_companies = df_multiple_sectors[
    (df_multiple_sectors['Sector'] != 'Banking, Insurance & Financial Services') &
    (df_multiple_sectors['Sector'] != 'Construction') &
    (df_multiple_sectors['Sector'] != 'Public Administration, Education, Health Social Services') &
    (df_multiple_sectors['Sector'] != 'Transport, Freight & Storage')
].copy()

selected_companies.sort_values(by='Sector', ascending=True)

Unnamed: 0,No,Name,Sector,Std of ROA,Country
1,440.0,BOOKING HOLDINGS INC.,Business Services,0.729338,U.S.
21,14.0,JAPAN POST HOLDING CO LTD,Business Services,0.000457,Japan
2,5.0,EXXON MOBIL CORP,"Chemicals, Petroleum, Rubber & Plastic",0.053457,U.S.
22,292.0,AIR WATER INC,"Chemicals, Petroleum, Rubber & Plastic",0.005723,Japan
7,196.0,COCA-COLA COMPANY (THE),Food & Tobacco Manufacturing,0.048408,U.S.
24,182.0,SUNTORY BEVERAGE & FOOD LTD,Food & Tobacco Manufacturing,0.009931,Japan
8,13.0,MICROSOFT CORPORATION,"Industrial, Electric & Electronic Machinery",0.052224,U.S.
25,415.0,"NISSIN FOODS HOLDINGS CO., LTD.","Industrial, Electric & Electronic Machinery",0.009611,Japan
12,293.0,"CBRE GROUP, INC.",Property Services,0.059689,U.S.
27,213.0,MITSUBISHI ESTATE CO LTD,Property Services,0.010026,Japan


## Step 2. Data Collection

- Download risk-related artilces and the articles about selected companies and  from [Factiva](https://global-factiva-com.ezproxy.cul.columbia.edu/sb/default.aspx?lnep=hp).

- For the risk-related articles, this project pick ups about 100 articles from 2013 to 2018 with the keyword of "Risk Attitude."

- For artilces about selected firms, this project chose about 100 articles from 2013 to 2018 with the keywords of each company's name.

- Put those RTF files into file path variables.

In [14]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [16]:
# Imoort risk-ralated articles.
file_path_risk = '/content/risk.rtf'

In [17]:
# Imoort articles about American risk-seeking companies.
file_path_amazon = '/content/us_amazon.rtf'
file_path_avnet =  '/content/us_avnet.rtf'
file_path_booking = '/content/us_booking.rtf'
file_path_coca_cola = '/content/us_coca_cola.rtf'
file_path_exxon = '/content/us_exxon.rtf'
file_path_gbre = '/content/us_gbre.rtf'
file_path_general_motors = '/content/us_general_motors.rtf'
file_path_microsoft = '/content/us_microsoft.rtf'
file_path_nrg = '/content/us_nrg.rtf'
file_path_starbucks = '/content/us_starbucks.rtf'

In [18]:
# Imoort articles about Japanese risk-averse companies.
file_path_air_water = '/content/jp_air_water.rtf'
file_path_fuyo = '/content/jp_fuyo.rtf'
file_path_japan_post = '/content/jp_japan_post.rtf'
file_path_medipal = '/content/jp_medipal.rtf'
file_path_mitsubishi_estate = '/content/jp_mitsubishi_estate.rtf'
file_path_nissin = '/content/jp_nissin.rtf'
file_path_osaka = '/content/jp_osaka.rtf'
file_path_seven_i = '/content/jp_seven_i.rtf'
file_path_suntory = '/content/jp_suntory.rtf'
file_path_toyota = '/content/jp_toyota.rtf'

## Step 3. Data Cleaning

- Clean the data by making text lowercase letters, tokenizing texts, eliminating stopwords, and lematizing tokens.

- Addtionally, this project eliminate words that are clearly unrelated but apear as risk-related words in the process of iterating identifying words related to risk.

In [23]:
# Extract text from RTF file.

from striprtf.striprtf import rtf_to_text

def extract_text_from_rtf(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        rtf_content = file.read()
    return rtf_to_text(rtf_content)

In [25]:
import re

# Process text.
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    return text

# Tokenize text.
def tokenize_text(text):
    return word_tokenize(text)

# Eliminate stopwords.
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word not in stop_words]

# Eliminate specific words.
def remove_specific_words(tokens, words_list):
    return [word for word in tokens if word not in words_list]

# Lematize tokens.
def lemmatize_tokens(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in tokens]

# Create tokens for analysis.
def create_tokens_for_analysis(text, words_list):
    processed_text = preprocess_text(text)
    tokens1 = tokenize_text(processed_text)
    tokens2 = remove_stopwords(tokens1)
    tokens3 = remove_specific_words(tokens2, words_list)
    tokens4 = lemmatize_tokens(tokens3)
    return tokens4

In [43]:
# Pick up unrelated words.
words_list = ['news', 'one', 'people', 'would', 'think', 'attitude']

In [21]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [26]:
# Create tokens of risk-related articles.
tokens_risk = create_tokens_for_analysis(extract_text_from_rtf(file_path_risk), words_list)

In [27]:
# Create tokens of articles about risk-seeking American companies.
us_tokens_amazon = create_tokens_for_analysis(extract_text_from_rtf(file_path_amazon), words_list)
us_tokens_avnet = create_tokens_for_analysis(extract_text_from_rtf(file_path_avnet), words_list)
us_tokens_booking = create_tokens_for_analysis(extract_text_from_rtf(file_path_booking), words_list)
us_tokens_coca_cola = create_tokens_for_analysis(extract_text_from_rtf(file_path_coca_cola), words_list)
us_tokens_exxon = create_tokens_for_analysis(extract_text_from_rtf(file_path_exxon), words_list)
us_tokens_gbre = create_tokens_for_analysis(extract_text_from_rtf(file_path_gbre), words_list)
us_tokens_general_motors = create_tokens_for_analysis(extract_text_from_rtf(file_path_general_motors), words_list)
us_tokens_microsoft = create_tokens_for_analysis(extract_text_from_rtf(file_path_microsoft), words_list)
us_tokens_nrg = create_tokens_for_analysis(extract_text_from_rtf(file_path_nrg), words_list)
us_tokens_starbucks = create_tokens_for_analysis(extract_text_from_rtf(file_path_starbucks), words_list)

In [28]:
# Create tokens of articles about risk-averse Japanese companies.
jp_tokens_air_water = create_tokens_for_analysis(extract_text_from_rtf(file_path_air_water), words_list)
jp_tokens_fuyo = create_tokens_for_analysis(extract_text_from_rtf(file_path_fuyo), words_list)
jp_tokens_japan_post = create_tokens_for_analysis(extract_text_from_rtf(file_path_japan_post), words_list)
jp_tokens_medipal = create_tokens_for_analysis(extract_text_from_rtf(file_path_medipal), words_list)
jp_tokens_mitsubishi_estate = create_tokens_for_analysis(extract_text_from_rtf(file_path_mitsubishi_estate), words_list)
jp_tokens_nissin = create_tokens_for_analysis(extract_text_from_rtf(file_path_nissin), words_list)
jp_tokens_osaka = create_tokens_for_analysis(extract_text_from_rtf(file_path_osaka), words_list)
jp_tokens_seven_i = create_tokens_for_analysis(extract_text_from_rtf(file_path_seven_i), words_list)
jp_tokens_suntory = create_tokens_for_analysis(extract_text_from_rtf(file_path_suntory), words_list)
jp_tokens_toyota = create_tokens_for_analysis(extract_text_from_rtf(file_path_toyota), words_list)

## Step 4. Data Analysis to Identify Risk-related Words

- Extract words related to risk by performing topic modeling analysis on articles about risk.

- Identify the most appropriate list of the topic by examining the distribution of each topic.

- Add “shareholder", "employee", "fire", and "secure" to the list of words to be used to test the hypothesis.

In [29]:
# Define functions about topic modeling.
from gensim import corpora, models
from gensim.models import LdaModel

def perform_topic_modeling(tokens, num_topics=3, passes=15, num_words=10):
    dictionary = corpora.Dictionary([tokens])
    corpus = [dictionary.doc2bow(tokens)]
    lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=passes)
    topic_list = lda_model.print_topics(num_topics=num_topics, num_words=num_words)
    topics = {}
    for idx, topic in topic_list:
        topics[f'Topic {idx}'] = topic
    log_likelihood = lda_model.log_perplexity(corpus)
    return lda_model, topics, log_likelihood, dictionary, corpus

def get_document_topic_distribution(lda_model, dictionary, document_tokens):
    bow = dictionary.doc2bow(document_tokens)
    topic_distribution = lda_model.get_document_topics(bow, minimum_probability=0)
    return topic_distribution

In [30]:
# Identify topics, log likelihood, and document topic distribution.
lda_model, risk_topics, log_likelihood, dictionary, corpus = perform_topic_modeling(tokens_risk, num_topics=5, passes=15, num_words=10)
document_topic_distribution = get_document_topic_distribution(lda_model, dictionary, tokens_risk)

print("Topics:", risk_topics)
print("Log Likelihood:", log_likelihood)
print("Document Topic Distribution:", document_topic_distribution)

Topics: {'Topic 0': '0.001*"risk" + 0.000*"research" + 0.000*"attitude" + 0.000*"study" + 0.000*"market" + 0.000*"investor" + 0.000*"individual" + 0.000*"know" + 0.000*"decision" + 0.000*"farmer"', 'Topic 1': '0.001*"risk" + 0.000*"attitude" + 0.000*"research" + 0.000*"information" + 0.000*"individual" + 0.000*"decision" + 0.000*"level" + 0.000*"investor" + 0.000*"financial" + 0.000*"money"', 'Topic 2': '0.001*"risk" + 0.000*"attitude" + 0.000*"research" + 0.000*"decision" + 0.000*"individual" + 0.000*"market" + 0.000*"study" + 0.000*"investor" + 0.000*"result" + 0.000*"journal"', 'Topic 3': '0.000*"risk" + 0.000*"attitude" + 0.000*"research" + 0.000*"decision" + 0.000*"investor" + 0.000*"financial" + 0.000*"individual" + 0.000*"farmer" + 0.000*"information" + 0.000*"time"', 'Topic 4': '0.022*"risk" + 0.009*"attitude" + 0.006*"research" + 0.004*"investor" + 0.004*"individual" + 0.004*"decision" + 0.004*"financial" + 0.004*"study" + 0.003*"market" + 0.003*"information"'}
Log Likelihood:

In [31]:
# Display all topic words to check for inappropriate words.
topics_words = {}
for topic, terms in risk_topics.items():
    words = [term.split('*')[1].replace('"', '') for term in terms.split(' + ')]
    topics_words[topic] = words

for topic, words in topics_words.items():
    print(f'{topic}: {words}')

Topic 0: ['risk', 'research', 'attitude', 'study', 'market', 'investor', 'individual', 'know', 'decision', 'farmer']
Topic 1: ['risk', 'attitude', 'research', 'information', 'individual', 'decision', 'level', 'investor', 'financial', 'money']
Topic 2: ['risk', 'attitude', 'research', 'decision', 'individual', 'market', 'study', 'investor', 'result', 'journal']
Topic 3: ['risk', 'attitude', 'research', 'decision', 'investor', 'financial', 'individual', 'farmer', 'information', 'time']
Topic 4: ['risk', 'attitude', 'research', 'investor', 'individual', 'decision', 'financial', 'study', 'market', 'information']


In [110]:
# Add words of shareholder and employment, which are related to the hypothesis, to the word list of the topic with the highest probability.
selected_topic_words = topics_words['Topic 4']
risk_words = selected_topic_words + ['shareholder', 'employee', 'fire', 'secure']
print(risk_words)

['risk', 'attitude', 'research', 'investor', 'individual', 'decision', 'financial', 'study', 'market', 'information', 'shareholder', 'employee', 'fire', 'secure']


## Step 5. Data Analysis to Idenfity Factors that Affect Risk Attitudes


- This project calculates the TF-IDF values for a set of tokenized documents by using TfidfVectorizer from sklearn, with a predefined vocabulary (risk_words).

- The documents include tokens from both U.S. and Japanese companies, such as us_tokens_amazon and jp_tokens_toyota. These tokenized documents are converted into strings and processed to compute the TF-IDF matrix, which quantifies the importance of each word across the document collection.

- This matrix is transformed into a DataFrame using pandas, where each column represents a word and each row corresponds to a document, displaying the TF-IDF scores for analysis.

- Finally, this prokect shows the average TF-IDF scores of each risk word by coountry and theri differences.

In [123]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer using a predefined vocabulary of risk-related words
vectorizer = TfidfVectorizer(vocabulary=risk_words)

# List of tokenized documents from both US and Japanese companies
tokenized_documents = [
    us_tokens_amazon,
    us_tokens_avnet,
    us_tokens_booking,
    us_tokens_coca_cola,
    us_tokens_exxon,
    us_tokens_gbre,
    us_tokens_general_motors,
    us_tokens_microsoft,
    us_tokens_nrg,
    us_tokens_starbucks,
    jp_tokens_air_water,
    jp_tokens_fuyo,
    jp_tokens_japan_post,
    jp_tokens_medipal,
    jp_tokens_mitsubishi_estate,
    jp_tokens_nissin,
    jp_tokens_osaka,
    jp_tokens_seven_i,
    jp_tokens_suntory,
    jp_tokens_toyota,
]

# Join the tokens into strings for each document
documents = [' '.join(tokens) for tokens in tokenized_documents]

# Fit and transform the documents using the TF-IDF vectorizer
tfidf_matrix = vectorizer.fit_transform(documents)

# Convert the TF-IDF matrix to a DataFrame with the feature names as columns
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# List of document names corresponding to each tokenized document
file_names = [
    'us_tokens_amazon',
    'us_tokens_avnet',
    'us_tokens_booking',
    'us_tokens_coca_cola',
    'us_tokens_exxon',
    'us_tokens_gbre',
    'us_tokens_general_motors',
    'us_tokens_microsoft',
    'us_tokens_nrg',
    'us_tokens_starbucks',
    'jp_tokens_air_water',
    'jp_tokens_fuyo',
    'jp_tokens_japan_post',
    'jp_tokens_medipal',
    'jp_tokens_mitsubishi_estate',
    'jp_tokens_nissin',
    'jp_tokens_osaka',
    'jp_tokens_seven_i',
    'jp_tokens_suntory',
    'jp_tokens_toyota',
]

# Insert the document names as the first column in the DataFrame
tfidf_df.insert(0, 'document', file_names)

# Insert the country names based on the document names as the first column in the DataFrame
tfidf_df.insert(0, 'country', tfidf_df['document'].apply(lambda x: 'U.S.' if x.startswith('us') else 'Japan' if x.startswith('jp') else 'Unknown'))

# Display the DataFrame
tfidf_df


Unnamed: 0,country,document,risk,attitude,research,investor,individual,decision,financial,study,market,information,shareholder,employee,fire,secure
0,U.S.,us_tokens_amazon,0.034577,0.0,0.273734,0.530179,0.031698,0.00317,0.322718,0.003326,0.365939,0.582045,0.16424,0.034868,0.191622,0.003851
1,U.S.,us_tokens_avnet,0.048455,0.00621,0.046348,0.081109,0.017382,0.034764,0.330756,0.012157,0.806875,0.417131,0.178018,0.143689,0.0,0.021117
2,U.S.,us_tokens_booking,0.052922,0.0,0.519768,0.064262,0.018713,0.045743,0.370453,0.013089,0.720116,0.166326,0.189007,0.033268,0.0,0.0
3,U.S.,us_tokens_coca_cola,0.178069,0.001241,0.026205,0.071038,0.044458,0.055225,0.779211,0.00583,0.525683,0.167966,0.188173,0.101766,0.002333,0.000422
4,U.S.,us_tokens_exxon,0.199242,0.002738,0.2912,0.249401,0.053646,0.052113,0.28284,0.012865,0.433316,0.121217,0.70083,0.18393,0.007722,0.005586
5,U.S.,us_tokens_gbre,0.104184,0.0,0.040468,0.320301,0.0,0.005683,0.669015,0.0,0.548472,0.081797,0.337521,0.125977,0.0,0.0
6,U.S.,us_tokens_general_motors,0.087187,0.004079,0.202398,0.108984,0.030829,0.082211,0.367431,0.013177,0.789872,0.17645,0.150501,0.34483,0.00767,0.001387
7,U.S.,us_tokens_microsoft,0.038361,0.002599,0.161379,0.133601,0.050931,0.138241,0.21429,0.038167,0.817476,0.427257,0.115082,0.136786,0.012218,0.051268
8,U.S.,us_tokens_nrg,0.196981,0.0,0.059215,0.298493,0.006647,0.019941,0.564358,0.001395,0.599404,0.406048,0.159519,0.006647,0.0,0.0
9,U.S.,us_tokens_starbucks,0.004175,0.0,0.459198,0.484245,0.0,0.0,0.066792,0.0,0.075141,0.734716,0.062618,0.027554,0.0,0.005579


In [124]:
# Show the average TF-IDF scores of each risk word by coountry and theri differences.
grouped_tfidf_df = tfidf_df.groupby('country')[tfidf_df.columns[2:]].mean().reset_index()

japan_row = grouped_tfidf_df.loc[grouped_tfidf_df['country'] == 'Japan'].iloc[0, 1:]
us_row = grouped_tfidf_df.loc[grouped_tfidf_df['country'] == 'U.S.'].iloc[0, 1:]
difference = japan_row - us_row
difference['country'] = 'Difference'
tfidf_country = pd.concat([grouped_tfidf_df, difference.to_frame().T])

tfidf_country

Unnamed: 0,country,risk,attitude,research,investor,individual,decision,financial,study,market,information,shareholder,employee,fire,secure
0,Japan,0.127458,0.000744,0.210835,0.214232,0.049644,0.053924,0.322085,0.028164,0.535001,0.430756,0.096247,0.041699,0.002578,0.016412
1,U.S.,0.094415,0.001687,0.207991,0.234161,0.02543,0.043709,0.396786,0.010001,0.568229,0.328095,0.224551,0.113931,0.022156,0.008921
0,Difference,0.033043,-0.000943,0.002844,-0.01993,0.024214,0.010215,-0.074701,0.018164,-0.033228,0.102661,-0.128304,-0.072232,-0.019579,0.007491


## Step 6. Data Visualization

In [125]:
import altair as alt

# Melt the tfidf_df DataFrame to long format, keeping 'document' as the identifier variable
tfidf_melted = tfidf_df.melt(id_vars=['document'], var_name='term', value_name='tfidf')

# Sort the melted DataFrame by 'document', 'tfidf', and 'term' columns
tfidf_melted = tfidf_melted.sort_values(by=['document', 'tfidf', 'term'], ascending=[True, False, True])

# Create a 'rank' column indicating the rank of each term within each document
tfidf_melted['rank'] = tfidf_melted.groupby('document').cumcount() + 1

# List of terms to highlight in red
term_list1 = ['shareholder']

# List of terms to highlight in blue
term_list2 = ['employee']

# Base chart definition with 'rank' on x-axis and 'document' on y-axis
base = alt.Chart(tfidf_melted).encode(
    x='rank:O',
    y='document:N'
)

# Define heatmap using the 'tfidf' values for color encoding
heatmap = base.mark_rect().encode(
    color='tfidf:Q'
)

# Define red circles for terms in term_list1 ('shareholder')
circle_red = base.mark_circle(size=100).encode(
    color=alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list1),
        alt.value('red'),
        alt.value('#FFFFFF00')
    )
)

# Define blue circles for terms in term_list2 ('employee')
circle_blue = base.mark_circle(size=100).encode(
    color=alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list2),
        alt.value('blue'),
        alt.value('#FFFFFF00')
    )
)

# Add text labels to the heatmap, with color conditional on the 'tfidf' value
text = base.mark_text(baseline='middle').encode(
    text='term:N',
    color=alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# Combine heatmap, red circles, blue circles, and text labels into a single chart
(heatmap + circle_red + circle_blue + text).properties(width=1100)

In [126]:
import altair as alt

# Melt the grouped_tfidf_df DataFrame to long format, keeping 'country' as the identifier variable
tfidf_melted2 = grouped_tfidf_df.melt(id_vars=['country'], var_name='term', value_name='tfidf')

# Sort the melted DataFrame by 'country', 'tfidf', and 'term' columns
tfidf_melted2 = tfidf_melted2.sort_values(by=['country', 'tfidf', 'term'], ascending=[True, False, True])

# Create a 'rank' column indicating the rank of each term within each country
tfidf_melted2['rank'] = tfidf_melted2.groupby('country').cumcount() + 1

# Base chart definition with 'rank' on x-axis and 'country' on y-axis
base = alt.Chart(tfidf_melted2).encode(
    x='rank:O',
    y='country:N'
)

# Define heatmap using the 'tfidf' values for color encoding
heatmap = base.mark_rect().encode(
    color='tfidf:Q'
)

# Define red circles for terms in term_list1 ('shareholder')
circle_red2 = base.mark_circle(size=100).encode(
    color=alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list1),
        alt.value('red'),
        alt.value('#FFFFFF00')
    )
)

# Define blue circles for terms in term_list2 ('employee')
circle_blue2 = base.mark_circle(size=100).encode(
    color=alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list2),
        alt.value('blue'),
        alt.value('#FFFFFF00')
    )
)

# Add text labels to the heatmap, with color conditional on the 'tfidf' value
text = base.mark_text(baseline='middle').encode(
    text='term:N',
    color=alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# Combine heatmap, red circles, blue circles, and text labels into a single chart
(heatmap + circle_red2 + circle_blue2 + text).properties(width=1100)

## Step 7. Conclusion and Policy Implication

- This project hypothesized that risk-averse Japanese companies prioritize their employees and avoid risks even if it means sacrificing high returns, whereas risk-taking American companies prioritize shareholders and take risks to achieve high returns. Analysis of the articles revealed that, based on TF-IDF comparison, American companies indeed focus more on shareholders than their Japanese counterparts, supporting the hypothesis. Terms related to shareholders like "investor" and "market" scored higher overall in American companies, further backing this hypothesis.

- However, American companies also mentioned employees more frequently than Japanese companies, which contradicts the hypothesis. This difference may be due to mixed references to employee-related terms, including both layoffs and job retention. Words suggesting layoffs, such as "fire," appeared more frequently in American companies, while words indicating job retention, such as "secure," were more common in Japanese companies. This suggests that Japanese companies are more concerned with maintaining employment compared to American companies.

- Interestingly, words like "risk," "study," and "information" had higher scores in Japanese companies, implying that they might prioritize information gathering and analysis to mitigate risks.

- Overall, these findings support the hypothesis that employee focus leads to risk aversion, while shareholder focus encourages risk-taking.

- As a policy recommendation, instead of creating public funds to inject risk money into companies, the Japanese government should encourage firms to prioritize shareholders more explicitly. This could involve amending corporate laws to state that, in joint-stock companies, shareholder interests take precedence over those of other stakeholders, including employees. Such actions would push Japanese companies to pursue profit-maximizing opportunities, potentially ending the long-standing economic stagnation. While this might initially impact job retention negatively, in the long term, economic growth could provide workers with better employment opportunities.