# Step 1: Install Required Libraries

We install the libraries `PyPDF2`, `pdfminer.six`, and `pdfplumber` which will help us in processing and extracting text from PDF files.

In [None]:
!pip install PyPDF2 pdfminer.six
!pip install pdfplumber
import PyPDF2
import pdfplumber



# Step 2: Extract Text from PDF

We specify the path to the PDF file containing the transcript. Using `pdfplumber`, we open the PDF and extract text from each page, concatenating it into a single string. Finally, we print a preview of the first 1000 characters of the extracted text to check the output.


In [None]:
import pdfplumber

pdf_path = '/content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/SoftBank/SoftBank_2023_Q3_QA_JP.pdf'

text = ""
with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        text += page.extract_text()

print(text[:1000])  # Display a preview of the first 1000 characters

2024年３月期第３四半期決算
Global Conference Call質疑応答録
日時：２０２４年２月８日（木）
登壇者：取締役 専務執行役員、CFO 兼 CISO、財務統括 兼 管理統括 後藤 芳光
常務執行役員、経理統括 君和田 和子
CSusO, IR部長 兼 サステナビリティ部長 上利 陽太郎
CFO, SB Investment Advisers & SB Global Advisers, Navneet Govil
質疑応答
質問者１
Q１：
ここ数四半期、SVFに比べ、SBGからの投資の割合が引き続き高いことについて、何か戦略
的なポイントがあるのでしょうか。この傾向は今後も継続するのでしょうか？また、今後の投
資を考える際に考慮すべきことはありますか？
A１：
（後藤） 今後の投資のスタイルについては、これまでの四半期でも説明してきていますが、
SVF を通じた AI 系企業への投資は引き続き積極的に行っていきます。それと同時に、われ
われの戦略的なパートナーとなり得るような会社に関しては、SBG の自己勘定での投資も積
極的に行っていく予定です。
Q２：
日本の上場企業の情報開示に関する規制のハードルを下げようとする取り組みについて、多
くの報道がなされています。これには、企業を上場させる時間を短縮する提案や、日本の資本
市場活動に関するその他の提案が含まれます。投資家の関心が米国株式に過剰に集中してい
ますが、これらの動向を踏まえて日本市場への投資に関心が高まる可能性や、再度注目され
る可能性はありますか？
A２：
（後藤） 日本のIPO市場の改革についての質問かと思いますが、これまで日本の市場は海外
に比べると上場に比較的時間がかかる傾向がありましたが、そのあたりの改善に努めている
とは思います。その観点からは、日本でのアントレプレナーたちの IPO の件数は、これまでよ
りも増えてくるかもしれません。それは歓迎すべきことだと思います。
A２：
（上利） 併せて、日本への投資を増やす意向はありますかという質問もありました。
1A２：
（後藤） それは別の問題だと思います。われわれは、企業のクオリティをグローバルな視点で
精査し、投資判断を行っています。従って地域に関しては、中国での投資は抑えるスタンスを
明確にしていますが、それ以外は基本

# Step 3: Save Extracted Text to a .txt File

We specify the output path for the text file by replacing the `.pdf` extension in the original file path with `.txt`. Then, we open a new text file in write mode and save the extracted text to this file. Finally, we print a confirmation message indicating where the text has been saved.

In [None]:
# Specify output path for the .txt file
output_path = pdf_path.replace('.pdf', '.txt')  # Automatically changes the .pdf extension to .txt

# Write the extracted text to a .txt file
with open(output_path, 'w', encoding='utf-8') as text_file:
    text_file.write(text)

print(f"Text extracted and saved to {output_path}")

Text extracted and saved to /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/SoftBank/SoftBank_2023_Q3_QA_JP.txt


# Step 4: Batch Process PDF Files and Save as .txt

This code loops through all subfolders in the specified base directory to find PDF files. For each PDF, it extracts the text using `pdfplumber` and saves the extracted text to a corresponding `.txt` file. A confirmation message is printed for each processed file, indicating where the text has been saved.

In [None]:
import os
import pdfplumber

# Define the base directory containing all your PDF files
base_dir = '/content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts'

# Loop through each subfolder and file in the base directory
for root, dirs, files in os.walk(base_dir):
    for file_name in files:
        if file_name.endswith('.pdf'):  # Process only PDF files
            pdf_path = os.path.join(root, file_name)
            output_path = pdf_path.replace('.pdf', '.txt')  # Create .txt file path

            # Extract text from PDF and save it as a .txt file
            text = ""
            with pdfplumber.open(pdf_path) as pdf:
                for page in pdf.pages:
                    text += page.extract_text()

            with open(output_path, 'w', encoding='utf-8') as text_file:
                text_file.write(text)

            print(f"Processed and saved: {output_path}")

Processed and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2023_Q1_QA_EN.txt
Processed and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2023_Q2_QA_EN.txt
Processed and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2023_Q3_QA_EN.txt
Processed and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2023_Q4_QA_EN.txt
Processed and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2024_Q1_QA_EN.txt
Processed and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2024_Q2_QA_EN.txt
Processed and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2023_Q1_QA_JP.txt
Processed and saved: /content/drive/MyDrive/Cola

# Step 5: Clean Extracted Text

This code defines a function, `minimal_clean_text`, which cleans the extracted text by removing:
- Basic headers and dates in various formats
- Speaker labels and question/answer identifiers (e.g., "Speaker 1", "Q1", "A1")

It also removes extra whitespace and newlines to create a more readable text format. The function is then tested on a sample text file, and the cleaned text is printed for review.

In [None]:
import re

def minimal_clean_text(text):
    # Remove only basic headers, dates, and speaker labels
    text = re.sub(r'\d{4}年\d+月\d+日（.*?）|[A-Za-z]+\s\d{1,2},\s\d{4}', '', text)  # Dates
    text = re.sub(r'(Speaker|質問者)\s*\d+|Q\d+[:：]|A\d+[:：]', '', text, flags=re.IGNORECASE)  # Speaker labels

    # Remove extra whitespace and newlines without re-encoding
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test the function on one file
sample_path = '/content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/SoftBank/SoftBank_2022_Q3_QA_EN.txt'
with open(sample_path, 'r', encoding='utf-8') as file:
    raw_text = file.read()

# Apply minimal cleaning
cleaned_text = minimal_clean_text(raw_text)
print(cleaned_text[:5000])  # Displaying the first 5000 characters as a sample


SoftBank Group Corp.: FY22Q3 Investor Call Q&A Presenters: Yoshimitsu Goto, Board Director, Corporate Officer, Senior Vice President, CFO & CISO, Head of Finance Unit & Administration Unit Kazuko Kimiwada, Corporate Officer, Senior Vice President, Head of Accounting Unit Navneet Govil, CFO, Member of the Executive Committee, SB Global Advisers Ian Thornton, VP Investor Relations, Arm Limited Thank you very much for taking my questions. I would like to ask you for the fair value of Arm. I am making a question here, looking at page 27 of the Finance section. It was 4.35 trillion yen in September quarter, and it is now 3.75 trillion yen. I believe that this is adjusted down because of strong yen, but even discount it back, the valuation is still 2.3 billion yen lower. If my understanding is correct, I would like to ask you what is the background for that? I would like to ask from Ian, what is your prospects for the short-term period? And give us some color on your prospects for the future

# Step 6: Refine Text Cleaning

This code defines a more refined function, `refined_clean_text`, which further cleans the extracted text by:
- Removing dates in both Japanese and Western formats.
- Eliminating speaker labels, such as "Speaker 1" and "質問者1".
- Removing text within parentheses and any phrases following asterisks, specifically targeting footnotes.
- Cleaning up orphaned words or symbols.
- Performing a final cleanup of whitespace and newlines to ensure a tidy output.

The function is then tested again on the sample text file, with the cleaned text printed for review.

In [None]:
def refined_clean_text(text):
    # Remove dates in Japanese and Western formats
    text = re.sub(r'\d{4}年\d+月\d+日（.*?）|[A-Za-z]+\s\d{1,2},\s\d{4}', '', text)

    # Remove speaker labels like "Speaker 1" and "質問者1"
    text = re.sub(r'(Speaker|質問者)\s*\d+|Q\d+[:：]|A\d+[:：]', '', text, flags=re.IGNORECASE)

    # Remove text within parentheses and any phrases following asterisks
    text = re.sub(r'\(.*?\)', '', text)
    text = re.sub(r'\*\w+', '', text)  # Specifically target footnotes

    # Additional cleanup for orphaned words or symbols
    text = re.sub(r'\s+\*\s*|\s+\d+\s+', ' ', text)

    # Final whitespace and newline cleanup
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test the function again on the sample file
cleaned_text = refined_clean_text(raw_text)
print(cleaned_text[:5000])  # Displaying the first 5000 characters as a sample


SoftBank Group Corp.: FY22Q3 Investor Call Q&A Presenters: Yoshimitsu Goto, Board Director, Corporate Officer, Senior Vice President, CFO & CISO, Head of Finance Unit & Administration Unit Kazuko Kimiwada, Corporate Officer, Senior Vice President, Head of Accounting Unit Navneet Govil, CFO, Member of the Executive Committee, SB Global Advisers Ian Thornton, VP Investor Relations, Arm Limited Thank you very much for taking my questions. I would like to ask you for the fair value of Arm. I am making a question here, looking at page of the Finance section. It was 4.35 trillion yen in September quarter, and it is now 3.75 trillion yen. I believe that this is adjusted down because of strong yen, but even discount it back, the valuation is still 2.3 billion yen lower. If my understanding is correct, I would like to ask you what is the background for that? I would like to ask from Ian, what is your prospects for the short-term period? And give us some color on your prospects for the future. T

# Step 7: Apply Text Cleaning to All Files

This code defines a function, `apply_cleaning_to_all_text_files`, that:
- Iterates through all `.txt` files in the specified directory.
- Reads the raw text from each file.
- Applies the `refined_clean_text` function to clean the text.
- Saves the cleaned text back to the original file.

After defining the function, it is executed to clean all relevant text files in the specified directory.

In [None]:
import os

# Path to the folder containing all .txt files
base_dir = '/content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts'

def apply_cleaning_to_all_text_files():
    for root, dirs, files in os.walk(base_dir):
        for file_name in files:
            if file_name.endswith('.txt'):
                text_path = os.path.join(root, file_name)

                # Read raw text from file
                with open(text_path, 'r', encoding='utf-8') as file:
                    raw_text = file.read()

                # Apply the cleaning function
                cleaned_text = refined_clean_text(raw_text)

                # Save the cleaned text back to the file
                with open(text_path, 'w', encoding='utf-8') as file:
                    file.write(cleaned_text)

                print(f"Cleaned and saved: {text_path}")

# Run the cleaning on all files
apply_cleaning_to_all_text_files()

Cleaned and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2023_Q4_QA_JP.txt
Cleaned and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2023_Q1_QA_EN.txt
Cleaned and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2024_Q1_QA_EN.txt
Cleaned and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2023_Q1_QA_JP.txt
Cleaned and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2023_Q3_QA_EN.txt
Cleaned and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2023_Q3_QA_JP.txt
Cleaned and saved: /content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/Fast Retailing/FastRetailing_2023_Q2_QA_EN.txt
Cleaned and saved: /content/drive/MyDrive/Colab Notebooks/JPEN

# Step 8: Install SpaCy and Download Language Model

This code installs the SpaCy library and downloads the English language model `en_core_web_sm`, which will be used for natural language processing tasks.


In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# Step 9: Segment English Text into Sentences

This code loads the English SpaCy model and defines a function to segment text into sentences. It processes a sample English transcript file and prints the first five sentences extracted from the text.

In [None]:
import spacy

# Load English spaCy model
nlp = spacy.load("en_core_web_sm")

def segment_sentences_en(text):
    # Process text with spaCy and extract sentences
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    return sentences

# Test on a sample English file
with open('/content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/SoftBank/SoftBank_2022_Q3_QA_EN.txt', 'r', encoding='utf-8') as file:
    raw_text_en = file.read()

english_sentences = segment_sentences_en(raw_text_en)
print(english_sentences[:5])  # Display first 5 sentences

['SoftBank Group Corp.:', 'FY22Q3 Investor Call Q&A Presenters: Yoshimitsu Goto, Board Director, Corporate Officer, Senior Vice President, CFO & CISO, Head of Finance Unit & Administration Unit Kazuko Kimiwada, Corporate Officer, Senior Vice President, Head of Accounting Unit Navneet Govil, CFO, Member of the Executive Committee, SB Global Advisers Ian Thornton, VP Investor Relations, Arm Limited Thank you very much for taking my questions.', 'I would like to ask you for the fair value of Arm.', 'I am making a question here, looking at page of the Finance section.', 'It was 4.35 trillion yen in September quarter, and it is now 3.75 trillion yen.']


# Step 10: Install MeCab and Unidic

This code installs the MeCab library for Japanese text processing and the `unidic-lite` package, which provides a lightweight dictionary for use with MeCab.

In [None]:
!pip install mecab-python3
!pip install unidic-lite



# Step 11: Segment Japanese Sentences

This code initializes MeCab for Japanese text processing and defines a function to segment sentences in Japanese. It splits the text based on sentence-ending punctuation, combines the sentences with their punctuation, and removes any extra whitespace. The function is then tested on a sample Japanese text file to demonstrate its functionality.

- **Initialize MeCab**: Sets up the MeCab tokenizer.
- **Define `segment_sentences_jp` Function**: Splits the input text into sentences based on Japanese punctuation.
- **Test the Function**: Reads a sample Japanese text file and outputs the first five segmented sentences.

In [None]:
import MeCab
import re

# Initialize MeCab with a basic tokenizer
mecab = MeCab.Tagger()

def segment_sentences_jp(text):
    # Split sentences based on Japanese sentence-ending punctuation
    sentences = re.split(r'(。|？|！)', text)  # Split at Japanese sentence-ending punctuation
    # Combine the punctuation back with each sentence
    sentences = [sent + punct for sent, punct in zip(sentences[::2], sentences[1::2])]
    # Remove any extra whitespace
    sentences = [sent.strip() for sent in sentences if sent]
    return sentences

# Test on a sample Japanese text file
sample_jp_path = '/content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts/SoftBank/SoftBank_2022_Q3_QA_JP.txt'
with open(sample_jp_path, 'r', encoding='utf-8') as file:
    raw_text_jp = file.read()

# Segment sentences
japanese_sentences = segment_sentences_jp(raw_text_jp)
print(japanese_sentences[:5])  # Display first 5 sentences as a sample


['ソフトバンクグループ株式会社 2023年3月期第3四半期 投資家コール質疑応答録 開催日： 登壇者：取締役 専務執行役員、CFO 兼 CISO、財務統括 兼 管理統括 後藤 芳光 常務執行役員、経理統括 君和田 和子 SB Global Advisers, CFO, Member of the Executive Committee, Navneet Govil Arm, VP Investor Relations, Ian Thornton 質疑応答 投資家向け説明会資料財務編の ページでは、Arm の公正価値は 月期 4.35 兆円に 対して今 3.75 兆円になっています。', '円高なので価値が下がっているということだと思いま すが、割り戻しても恐らく2.3 billionぐらい評価が下がっているのではないかと思います。', 'もし理解が正しければ、評価が下がった背景を教えていただけますか。', 'また、Ian さんから全 体感で結構ですので、Armの足元の状況、見通しを解説いただけるとありがたいです。', '（Govil） おっしゃるとおり、バリュエーションの方で少し動きがあり、パフォーマンス以外の 要因、主に為替とキャピタルコストの増加が理由となります。']


# Step 12: Install the Sentence Transformers Library

This command installs the `sentence-transformers` library, which is essential for working with sentence embeddings and implementing models for tasks such as semantic textual similarity and clustering.

In [None]:
!pip install sentence-transformers



# Step 13: Align Japanese and English Sentences Using Sentence Transformers

This code segment does the following:

1. **Imports Necessary Libraries**:
   - `os` for file and directory manipulation.
   - `sentence_transformers` to load the pre-trained model.
   - `re` for regular expressions.
   - `csv` for handling CSV file writing.

2. **Loads the Sentence Transformer Model**:
   - Initializes the `paraphrase-multilingual-MiniLM-L12-v2` model for encoding sentences.

3. **Defines the Path to Transcript Files**:
   - Sets the directory where the transcript files are located.

4. **Initializes Variables**:
   - `all_aligned_pairs` to store pairs of aligned sentences and their similarity scores.
   - `similarity_threshold` to filter out low-confidence matches.

5. **Defines a Function for Sentence Segmentation**:
   - Uses spaCy for English sentences and regex/MeCab for Japanese sentences.

6. **Processes Transcript Files**:
   - Loops through the directory structure to find Japanese and corresponding English files.
   - Reads and segments the text from both files.
   - Encodes the segmented sentences using the sentence transformer model.

7. **Calculates Cosine Similarities**:
   - For each Japanese sentence, it finds the best matching English sentence based on cosine similarity.
   - Only adds pairs with a similarity score above the threshold to the list.

8. **Saves Aligned Sentences**:
   - Writes the aligned sentence pairs and their similarity scores to a TSV file named `filtered_aligned_corpus.tsv`.

- **Output**: Displays the total number of aligned pairs collected above the threshold and confirms saving the aligned corpus to a file.

In [None]:
import os
from sentence_transformers import SentenceTransformer, util
import re
import csv

# Load the sentence transformer model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Define the path to the folder containing all transcript files
transcripts_dir = '/content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/Transcripts'

# Initialize a list to store all aligned pairs with scores
all_aligned_pairs = []
similarity_threshold = 0.7  # Set your threshold here

# Function to segment text based on language
def segment_sentences(text, lang='jp'):
    if lang == 'en':
        # English segmentation using spaCy
        doc = nlp(text)
        return [sent.text.strip() for sent in doc.sents]
    elif lang == 'jp':
        # Japanese segmentation using regex and MeCab
        sentences = re.split(r'(。|？|！)', text)
        sentences = [sent + punct for sent, punct in zip(sentences[::2], sentences[1::2])]
        return [sent.strip() for sent in sentences if sent]

# Recursively search through subdirectories
for root, dirs, files in os.walk(transcripts_dir):
    for file_name in files:
        if file_name.endswith('_JP.txt'):
            # Derive the English file name from the Japanese file name
            en_file_name = file_name.replace('_JP', '_EN')
            jp_path = os.path.join(root, file_name)
            en_path = os.path.join(root, en_file_name)

            # Check if the corresponding English file exists
            if os.path.exists(en_path):
                print(f"Processing file pair: {file_name} and {en_file_name}")  # Debugging line

                # Read and segment the Japanese text
                with open(jp_path, 'r', encoding='utf-8') as f:
                    jp_text = f.read()
                jp_sentences = segment_sentences(jp_text, lang='jp')
                print(f"Japanese sentences found: {len(jp_sentences)}")  # Debugging line

                # Read and segment the English text
                with open(en_path, 'r', encoding='utf-8') as f:
                    en_text = f.read()
                en_sentences = segment_sentences(en_text, lang='en')
                print(f"English sentences found: {len(en_sentences)}")  # Debugging line

                # Encode and align sentences for this file pair
                if jp_sentences and en_sentences:
                    jp_embeddings = model.encode(jp_sentences, convert_to_tensor=True)
                    en_embeddings = model.encode(en_sentences, convert_to_tensor=True)

                    for i, jp_embed in enumerate(jp_embeddings):
                        # Calculate similarity scores
                        cos_similarities = util.cos_sim(jp_embed, en_embeddings)[0]
                        best_match_idx = cos_similarities.argmax().item()
                        best_similarity_score = cos_similarities[best_match_idx].item()

                        # Add only high-confidence pairs to the final output
                        if best_similarity_score >= similarity_threshold:
                            all_aligned_pairs.append((
                                jp_sentences[i],  # Japanese sentence
                                en_sentences[best_match_idx],  # Best matching English sentence
                                best_similarity_score  # Similarity score
                            ))

print(f"Total aligned pairs collected above threshold: {len(all_aligned_pairs)}")  # Final debugging line

# Save the full aligned dataset with similarity scores to a TSV file
if all_aligned_pairs:
    with open('filtered_aligned_corpus.tsv', 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f, delimiter='\t')
        writer.writerow(['Japanese', 'English', 'SimilarityScore'])
        for jp, en, score in all_aligned_pairs:
            writer.writerow([jp, en, score])
    print("Filtered aligned corpus saved as 'filtered_aligned_corpus.tsv'")
else:
    print("No aligned pairs found above the threshold. Please check the threshold or directory structure.")


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Processing file pair: FastRetailing_2023_Q4_QA_JP.txt and FastRetailing_2023_Q4_QA_EN.txt
Japanese sentences found: 87
English sentences found: 95
Processing file pair: FastRetailing_2023_Q1_QA_JP.txt and FastRetailing_2023_Q1_QA_EN.txt
Japanese sentences found: 42
English sentences found: 50
Processing file pair: FastRetailing_2023_Q3_QA_JP.txt and FastRetailing_2023_Q3_QA_EN.txt
Japanese sentences found: 50
English sentences found: 63
Processing file pair: FastRetailing_2023_Q2_QA_JP.txt and FastRetailing_2023_Q2_QA_EN.txt
Japanese sentences found: 62
English sentences found: 80
Processing file pair: FastRetailing_2024_Q1_QA_JP.txt and FastRetailing_2024_Q1_QA_EN.txt
Japanese sentences found: 76
English sentences found: 97
Processing file pair: FastRetailing_2024_Q2_QA_JP.txt and FastRetailing_2024_Q2_QA_EN.txt
Japanese sentences found: 85
English sentences found: 95
Processing file pair: Nissan_2022_Q1_QA_JP.txt and Nissan_2022_Q1_QA_EN.txt
Japanese sentences found: 66
English sente

# Step 14: Import Necessary Libraries and Install Required Package

- **pandas**: For data manipulation and analysis using DataFrames.
- **train_test_split, KFold**: Functions from `sklearn.model_selection` for splitting datasets and performing K-Fold cross-validation.
- **accuracy_score, precision_score, recall_score, f1_score**: Metrics from `sklearn.metrics` to evaluate model performance.
- **SentenceTransformer**: Class from the `sentence_transformers` library for loading pre-trained models to generate sentence embeddings.
- **datasets**: Installing the `datasets` library, which provides a collection of ready-to-use datasets for machine learning tasks.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
!pip install datasets
from datasets import Dataset



# Step 15: Load Filtered Aligned Corpus Data

- **pd.read_csv**: Reads the filtered aligned corpus data from a TSV file.
- **File Path**: The path to the TSV file is specified, pointing to the location in Google Drive.
- **sep='\t'**: Indicates that the file is tab-separated, which is typical for TSV (Tab-Separated Values) files.

This code will load the data into a pandas DataFrame (`df`), which can then be used for further analysis or model training.


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/JPENCorpLex/Data/filtered_aligned_corpus.tsv', sep='\t')

# Step 16: Train-Test Split

- **train_test_split**: This function splits the original DataFrame (`df`) into training and testing subsets.
- **test_size=0.2**: Specifies that 20% of the data should be reserved for testing, while the remaining 80% will be used for training.
- **random_state=42**: Ensures reproducibility of the split. Using the same random state will yield the same train-test split each time the code is run.

### Output:
- The first five rows of the training DataFrame (`train_df`) are printed for inspection.
- The first five rows of the testing DataFrame (`test_df`) are also printed to review the data being used for evaluation.


In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Print the first five rows of the training data
print("Training Data (First 5 Rows):")
print(train_df.head())
print("\n")  # Adding a newline for better readability

# Print the first five rows of the testing data
print("Testing Data (First 5 Rows):")
print(test_df.head())


Training Data (First 5 Rows):
                                               Japanese  \
3741  併せて、その実現益を見合いに、 含み損を抱えた債券を売却・整理するとともに、「満期保有目的債...   
3704  先日発表した KS による 野村證券の在タイ子会社の買収は、タイにおける証券市場の成長性を取...   
2346                   梅津 [A]：全体的には収益性をしっかり高めるべく動いています。   
3633  Q. 法人との資本ビジネスに拡大余地があると考える背景と、相続を起点とする不動産ビジ ネスが...   
4394   前年同期に発売した『ゴッド・オブ・ウォー ラグナロク』に続く2年連続での⼤ヒット創出とな...   

                                                English  SimilarityScore  
3741  Furthermore, we have utilized our held-to-matu...         0.756145  
3704  Moreover, the acquiree ranks second in Thailan...         0.704928  
2346  [A]: Overall looking, yes, we are taking actio...         0.722735  
3633  Furthermore, please discuss the factors contri...         0.765240  
4394  The game is our second blockbuster hit in two ...         0.712616  


Testing Data (First 5 Rows):
                                               Japanese  \
911   現在の NAV ディスカウントを拝見すると、今の株価で考えれ ば、ソフトバンクグループ株式は...   
1183            

# Step 17: Initialize and Train the Model

1. **Import Necessary Libraries**:
   - Import the `SentenceTransformer` and related classes from the `sentence_transformers` library, as well as the `DataLoader` from PyTorch.

2. **Model Initialization**:
   - **SentenceTransformer**: The model is initialized using the `paraphrase-multilingual-MiniLM-L12-v2` architecture, which is suitable for multilingual paraphrase tasks.

3. **Prepare Training Data**:
   - The training examples are prepared using the `InputExample` format, where:
     - Each example consists of a Japanese sentence and its corresponding English translation.
     - The label is set to `1.0` if the `SimilarityScore` exceeds a defined threshold (0.70); otherwise, it is set to `0.0`.

4. **Create DataLoader**:
   - A `DataLoader` is created for the training examples, allowing for shuffling of the data and defining a batch size of 16 for training.

5. **Define Loss Function**:
   - The loss function is defined as `CosineSimilarityLoss`, which will be used to optimize the model during training.

6. **Train the Model**:
   - The model is trained using the `fit` method, which takes:
     - **train_objectives**: A list containing the `train_dataloader` and the defined `train_loss`.
     - **epochs**: The number of training epochs (set to 2 for this run).
     - **warmup_steps**: The number of warmup steps before applying the learning rate schedule (set to 100 for this run).


In [None]:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from datasets import Dataset

# Initialize the model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Prepare training data in InputExample format with thresholding
threshold = 0.7  # Define your threshold here
train_examples = [
    InputExample(texts=[row['Japanese'], row['English']],
                 label=1.0 if row['SimilarityScore'] > threshold else 0.0)
    for index, row in train_df.iterrows()
]

# Create a DataLoader for the training examples
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Define the loss function
train_loss = losses.CosineSimilarityLoss(model)

# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=2,  # Set this to your desired number of epochs
          warmup_steps=100)  # Set warmup steps accordingly

# Optionally save the trained model
model.save("trained_paraphrase_model")  # Save your model to disk



Step,Training Loss


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

# Step 18: Evaluate the Model

In this step, we will evaluate the performance of the trained model by calculating the cosine similarity between the Japanese and English sentences in the test dataset.

1. **Extract Test Sentences**: We create a list of tuples, where each tuple contains a Japanese sentence and its corresponding English sentence.

2. **Get Embeddings**: Using the trained `SentenceTransformer` model, we encode both the Japanese and English sentences into embeddings. These embeddings are vector representations that capture the meaning of the sentences.

3. **Compute Cosine Similarities**: We calculate the cosine similarity between the Japanese and English embeddings. This metric will help us determine how similar each Japanese sentence is to its corresponding English translation.

4. **Output Results**: Finally, we print the cosine similarities along with the original sentences. This allows us to assess how well the model aligns sentences from both languages.


In [None]:
# Extract Japanese and English sentences from the test DataFrame
test_sentences = list(zip(test_df['Japanese'], test_df['English']))

# Get embeddings for the test sentences using the trained model
jp_embeddings = model.encode([sent[0] for sent in test_sentences], convert_to_tensor=True)
en_embeddings = model.encode([sent[1] for sent in test_sentences], convert_to_tensor=True)

# Compute cosine similarities between Japanese and English embeddings
from sentence_transformers import util

cosine_similarities = util.cos_sim(jp_embeddings, en_embeddings)

# Print out the cosine similarities for analysis
print("Cosine Similarities:")
for i, row in enumerate(cosine_similarities):
    print(f"Japanese Sentence {i + 1}: {test_sentences[i][0]}")
    print(f"English Sentence {i + 1}: {test_sentences[i][1]}")
    print(f"Similarity Score: {row.max().item()}")  # Print the max similarity score for each pair
    print("-----")


Cosine Similarities:
Japanese Sentence 1: 現在の NAV ディスカウントを拝見すると、今の株価で考えれ ば、ソフトバンクグループ株式は100％近いアップサイドがあると言えると思います。
English Sentence 1: If I look at your current discount to NAV, the SoftBank Group shares could offer close to 100% upside at the current share price, and this is based on a NAV which is depressed, so arguably the upside is potentially closer to 150% if the NAV goes back to more fair value territory.
Similarity Score: 0.9959877729415894
-----
Japanese Sentence 2: １四半期あたり$１Bというペースは今後も続くと考えていいのでしょうか？
English Sentence 2: Is it right to think that that pace of about a billion per quarter will continue?
Similarity Score: 0.9937347173690796
-----
Japanese Sentence 3: 実際、当社経由のNISA買付額は、先週までの僅か3週 間で、2023年の年間買付額の3 分の1以上に達しています。
English Sentence 3: In just the three weeks to last week, we saw NISA sales at over one third of annual sales for 2023.
Similarity Score: 0.9935804605484009
-----
Japanese Sentence 4: まずは、そうした投資を優先し、成長の道筋をし っかりつくることに注力していきます。
English Sentence 4: We will priorit