# Plot Collocations Pair Tool

**Purpose:** This notebook plots the log ratio of frequencies of words collocated with a user-specified pair of words (e.g., 'he' vs 'she'). It helps visualize which words are more likely to appear with one word over the other in a given text corpus.

**Inputs:**
* A CSV file with a column containing lemmatized text.
* The name of the column with the lemmatized text.
* Two words to compare (e.g., 'he', 'she').
* Desired name for the output plot image and data CSV.

**Outputs:**
* A plot showing words more associated with one or the other target word.
* A CSV file containing the data used for the plot (bigram counts, ratios).
* Files are saved to Google Drive and offered for download.

**Instructions:**
* Run cells sequentially.
* Upload CSV when prompted.
* Configure parameters in the 'User Input Configuration' section.
* Authenticate Google Drive if needed.

## Using This Notebook & Viewing Code

**Cell Visibility:** This Colab notebook uses `#@title` directives for its main operational cells (like "Step 1: Install Libraries", "Step 2: Import Libraries", etc.). This allows you to collapse or expand the code in these cells.
*   To **hide the code** for a cell, click the small arrow next to its title or in the cell toolbar.
*   To **show the code**, click the arrow again.
This helps in focusing on the instructions and outputs rather than the underlying code, if you prefer.

**Simplified Inputs:** For cells where you need to provide input (like "Step 5: Configure Parameters"), this notebook uses Colab's "form" feature. You can enter your parameters directly in the input fields provided, and you don't need to modify the code in that cell unless you want to change default behaviors.

---


In [None]:
#@title Step 1: Install libraries required for the notebook.
# This cell installs the necessary libraries for the notebook.
# It will only run once or if the libraries are not already installed in your Colab environment.
print("Installing pandas, nltk, matplotlib, and seaborn...")
!pip install pandas nltk matplotlib seaborn -q
print("Installation complete.")

In [None]:
#@title Step 2: Import the libraries needed for the script.
import pandas as pd
from google.colab import files
import nltk
from nltk.util import ngrams # <--- ADDED THIS LINE
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import shutil

print("Starting library imports and NLTK resource checks...")

# NLTK Resource Download Section
nltk_resources = [
    ('tokenizers/punkt', 'punkt'),
    ('tokenizers/punkt_tab/english/', 'punkt_tab') # Check for the specific English path directory
]

for resource_path, resource_id in nltk_resources:
    try:
        nltk.data.find(resource_path)
        print(f"NLTK resource '{resource_id}' (path: {resource_path}) already available.")
    except LookupError:
        print(f"NLTK resource '{resource_id}' (path: {resource_path}) not found. Attempting download...")
        try:
            nltk.download(resource_id, quiet=False) # Download with quiet=False for more verbose output
            print(f"Successfully downloaded NLTK resource '{resource_id}'.")
            # Optionally, re-verify after download
            nltk.data.find(resource_path)
            print(f"NLTK resource '{resource_id}' now available after download.")
        except Exception as e:
            print(f"Error downloading NLTK resource '{resource_id}': {e}")
            print(f"Please try manually downloading '{resource_id}' using 'nltk.download(\"{resource_id}\")' in a new cell if issues persist.")
    except Exception as e:
        print(f"An unexpected error occurred while checking for NLTK resource '{resource_id}': {e}")

print("Finished NLTK resource checks.")
print("Libraries imported.")

# Set a plot style
plt.style.use('seaborn-v0_8-whitegrid')

In [None]:
#@title Step 3: Mount your Google Drive to Colab.
from google.colab import drive
import os

try:
    drive.mount('/content/drive', force_remount=False) # Set force_remount=True if you always want to re-authenticate
    print("Google Drive mounted successfully.")

    # Define the output directory in Google Drive
    # You can change 'Colab_Data_Collocations' to any folder name you prefer.
    output_directory = '/content/drive/MyDrive/Colab_Data_Collocations_Pair'

    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
        print(f"Output directory '{output_directory}' created successfully in your Google Drive.")
    else:
        print(f"Output directory '{output_directory}' already exists in your Google Drive.")

except Exception as e:
    print(f"An error occurred during Google Drive mounting or directory creation: {e}")
    output_directory = None # Ensure output_directory is None if mounting fails
    print("Please check your Google Drive permissions and authentication.")

In [None]:
#@title Step 4: Upload Your CSV File.
# Import necessary library for file uploading in Colab
from google.colab import files

# Prompt user to upload a CSV file
print('Please upload your CSV file:')
uploaded = files.upload()

# Check if a file was uploaded
if uploaded:
    # Get the name of the uploaded file
    input_csv_name = list(uploaded.keys())[0]
    print(f'Successfully uploaded "{input_csv_name}".')
else:
    input_csv_name = None
    print('No file uploaded. Please upload a file to proceed.')

In [None]:
#@title Step 5: Configure Parameters.
#@markdown ### Input Data Configuration
#@markdown Enter the name of the column in your CSV that contains the lemmatized text data.
text_column_name = "text_lemmatized"  #@param {type:"string"}

#@markdown --- 
#@markdown ### Target Word Configuration
#@markdown Specify the two words you want to compare collocations for.
word1_input = "he"  #@param {type:"string"}
word2_input = "she"  #@param {type:"string"}

#@markdown --- 
#@markdown ### Output File Configuration
#@markdown Define the names for your output files. The plot will be a PNG image and the data will be a CSV file.
output_plot_filename = "collocation_plot.png"  #@param {type:"string"}
output_csv_filename = "bigram_ratios.csv"  #@param {type:"string"}

# Basic validation for output file names
if not output_plot_filename.lower().endswith('.png'):
    output_plot_filename += '.png'
    print(f"Plot filename automatically appended with .png: {output_plot_filename}")

if not output_csv_filename.lower().endswith('.csv'):
    output_csv_filename += '.csv'
    print(f"CSV filename automatically appended with .csv: {output_csv_filename}")

# Print the configured parameters to confirm choices
print("--- Configuration Summary ---")
print(f"Uploaded CSV file name: {input_csv_name if 'input_csv_name' in locals() else 'Not yet uploaded'}")
print(f"Lemmatized text column: '{text_column_name}'")
print(f"Word 1 for comparison: '{word1_input}'")
print(f"Word 2 for comparison: '{word2_input}'")
print(f"Output plot file name: '{output_plot_filename}'")
print(f"Output CSV data file name: '{output_csv_filename}'")
print("---------------------------")

In [None]:
#@title Step 6: Process Data, Calculate Ratios, and Generate Plot.
print("Starting Step 6: Processing Data, Calculating Ratios...")

# Ensure input_csv_name is available from Step 4 and parameters from Step 5
if 'input_csv_name' not in locals() or input_csv_name is None:
    print("Error: CSV file name is not defined. Please upload a file in Step 4 and re-run this cell.")
elif 'text_column_name' not in locals() or 'word1_input' not in locals() or 'word2_input' not in locals():
    print("Error: Configuration parameters (text_column_name, word1_input, word2_input) not defined. Please run Step 5 and re-run this cell.")
else:
    try:
        # Load the uploaded CSV file
        df = pd.read_csv(input_csv_name)
        print(f"Successfully loaded '{input_csv_name}'.")

        # Check if the specified text column exists
        if text_column_name not in df.columns:
            print(f"Error: Column '{text_column_name}' not found in the CSV. Available columns are: {df.columns.tolist()}. Please check the column name in Step 5.")
        else:
            print(f"Using column '{text_column_name}' for text processing.")
            # Handle potential empty strings or NaN values in the text column
            df[text_column_name] = df[text_column_name].fillna('').astype(str)

            # Prepare target words (pronouns), converting to lowercase
            word1_processed = word1_input.lower()
            word2_processed = word2_input.lower()
            pronouns = [word1_processed, word2_processed]
            print(f"Target words for comparison (lowercase): {pronouns}")

            # --- Process Text and Generate Bigrams ---
            all_bigrams = []
            print(f"Processing text from '{text_column_name}' to generate bigrams...")
            for text in df[text_column_name]:
                # Ensure NLTK tokenizers are available (checked in Step 2, but good practice for standalone execution)
                try:
                    tokens = nltk.word_tokenize(text.lower()) # Tokenize and ensure text is lowercase
                except Exception as e:
                    print(f"NLTK word_tokenize error: {e}. Ensure NLTK resources from Step 2 are downloaded.")
                    tokens = [] # Avoid further errors
                current_bigrams = list(ngrams(tokens, 2)) # <--- CHANGED THIS LINE
                all_bigrams.extend(current_bigrams)
            
            if not all_bigrams:
                print("No bigrams were generated. This could be due to empty text in the specified column or issues with tokenization.")
                word_ratios_df = pd.DataFrame() # Ensure word_ratios_df exists
            else:
                print(f"Generated a total of {len(all_bigrams)} bigrams.")

                # --- Count and Filter Bigrams ---
                bigram_df = pd.DataFrame(all_bigrams, columns=['w1', 'w2'])
                bigram_df['w1'] = bigram_df['w1'].str.lower()
                bigram_df['w2'] = bigram_df['w2'].str.lower()

                # Filter for bigrams starting with one of the target pronouns
                filtered_bigrams = bigram_df[bigram_df['w1'].isin(pronouns)]

                if filtered_bigrams.empty:
                    print(f"No bigrams found starting with the target words: {pronouns}. Cannot proceed with ratio calculation.")
                    word_ratios_df = pd.DataFrame() # Ensure word_ratios_df exists
                else:
                    print(f"Found {len(filtered_bigrams)} bigrams starting with one of the target words.")
                    bigram_counts = filtered_bigrams.groupby(['w1', 'w2']).size().reset_index(name='n')

                    # --- Calculate Word Ratios ---
                    word_counts_pivot_prep = bigram_counts.rename(columns={'n': 'total'})

                    # Filter out words (w2) where the sum of total counts (for that w2 across both pronouns) is not greater than 10
                    w2_group_totals = word_counts_pivot_prep.groupby('w2')['total'].transform('sum')
                    word_counts_filtered = word_counts_pivot_prep[w2_group_totals > 10]

                    if word_counts_filtered.empty:
                        print("No words (w2) met the frequency threshold (>10 total occurrences with target words). Cannot calculate ratios.")
                        word_ratios_df = pd.DataFrame() # Ensure word_ratios_df exists
                    else:
                        print(f"Proceeding with {word_counts_filtered['w2'].nunique()} unique words (w2) that meet the frequency threshold.")
                        word_ratios_df = word_counts_filtered.pivot_table(
                            index='w2',
                            columns='w1',
                            values='total',
                            fill_value=0
                        )

                        if word1_processed not in word_ratios_df.columns:
                            word_ratios_df[word1_processed] = 0
                        if word2_processed not in word_ratios_df.columns:
                            word_ratios_df[word2_processed] = 0
                        
                        word_ratios_df[word1_processed] = word_ratios_df[word1_processed] + 1
                        word_ratios_df[word2_processed] = word_ratios_df[word2_processed] + 1

                        col_sum1 = word_ratios_df[word1_processed].sum()
                        col_sum2 = word_ratios_df[word2_processed].sum()
                        
                        word_ratios_df[f'{word1_processed}_norm'] = word_ratios_df[word1_processed] / col_sum1 if col_sum1 > 0 else 0
                        word_ratios_df[f'{word2_processed}_norm'] = word_ratios_df[word2_processed] / col_sum2 if col_sum2 > 0 else 0

                        epsilon = 1e-9 
                        word_ratios_df['logratio'] = np.log2(
                            (word_ratios_df[f'{word2_processed}_norm'] + epsilon) / 
                            (word_ratios_df[f'{word1_processed}_norm'] + epsilon)
                        )

                        word_ratios_df = word_ratios_df.sort_values(by='logratio', ascending=False)

                        print("--- Word Ratios Dataframe (Top 5) ---")
                        print(word_ratios_df[['logratio', word1_processed, word2_processed, f'{word1_processed}_norm', f'{word2_processed}_norm']].head())
            
    except FileNotFoundError:
        print(f"Error: The file '{input_csv_name}' was not found. Please ensure it was uploaded correctly in Step 4 and re-run.")
        word_ratios_df = pd.DataFrame() 
    except Exception as e:
        print(f"An unexpected error occurred during data processing in Step 6: {e}")
        import traceback
        traceback.print_exc()
        word_ratios_df = pd.DataFrame()

print("Step 6 data processing finished.")

# --- Generate the Plot (appended to Step 6) ---
if 'word_ratios_df' in locals() and not word_ratios_df.empty:
    print("Preparing data for plotting...")
    plot_df = word_ratios_df.reset_index() 
    plot_df['abslogratio'] = np.abs(plot_df['logratio'])

    N = 15
    group_more_word1 = plot_df[plot_df['logratio'] < 0].nsmallest(N, 'logratio', keep='first') 
    group_more_word2 = plot_df[plot_df['logratio'] >= 0].nlargest(N, 'logratio', keep='first')

    plot_data_df = pd.concat([group_more_word1, group_more_word2]).sort_values(by='logratio')

    if plot_data_df.empty:
        print("No data to plot after filtering top N words.")
    else:
        print(f"Plotting top {len(plot_data_df)} words by log ratio.")
        plt.figure(figsize=(10, max(6, len(plot_data_df) * 0.4)))

        colors = ['#F8766D' if x < 0 else '#00BFC4' for x in plot_data_df['logratio']]

        plt.hlines(y=plot_data_df['w2'], xmin=0, xmax=plot_data_df['logratio'], color=colors, alpha=0.8, linewidth=2)

        plt.scatter(plot_data_df['logratio'], plot_data_df['w2'], color=colors, s=50, zorder=3)

        plt.yticks(plot_data_df['w2'])
        plt.ylabel(None)

        plt.xlabel(f"Log Ratio: More '{word1_processed}' <-> More '{word2_processed}'")
        x_ticks_values = np.array([-3, -2, -1, 0, 1, 2, 3])
        x_ticks_labels = [f"{2**abs(x):.0f}x" if x != 0 else "Same" for x in x_ticks_values]
        plt.xticks(x_ticks_values, x_ticks_labels)
        plt.xlim(min(x_ticks_values)-0.5, max(x_ticks_values)+0.5) 

        plt.title(f"Words More Associated with '{word1_processed.capitalize()}' vs '{word2_processed.capitalize()}'", pad=20)

        from matplotlib.lines import Line2D
        legend_elements = [
            Line2D([0], [0], marker='o', color='w', label=f'More "{word2_processed.capitalize()}"', markerfacecolor='#00BFC4', markersize=10),
            Line2D([0], [0], marker='o', color='w', label=f'More "{word1_processed.capitalize()}"', markerfacecolor='#F8766D', markersize=10)
        ]
        plt.legend(handles=legend_elements, loc='lower center', ncol=2, bbox_to_anchor=(0.5, -0.15 - (0.03 * (len(plot_data_df)/10) ) ) )

        plt.grid(True, which='major', axis='x', linestyle='--', alpha=0.7)
        plt.grid(False, which='major', axis='y') 
        sns.despine(left=True, bottom=True) 
        plt.tight_layout(rect=[0, 0.05, 1, 0.95]) 
        
        temp_plot_path = "/tmp/plot_for_download.png"
        try:
            plt.savefig(temp_plot_path, bbox_inches='tight')
            print(f"Plot temporarily saved to {temp_plot_path}")
        except Exception as e:
            print(f"Error saving plot to temporary file: {e}")
            temp_plot_path = None 

        plt.show()
else:
    print("Skipping plot generation as 'word_ratios_df' is empty or not defined.")

print("Step 6 (including plotting) execution finished.")

In [None]:
#@title Step 7: Save Plot and Data, and Provide Download Links.
print("Starting Step 7: Saving results and providing download links...")

if 'output_directory' not in locals() or not output_directory:
    print("Error: Google Drive output directory ('output_directory') is not defined. Please run Step 3 (Mount Google Drive).")
else:
    if 'temp_plot_path' in locals() and temp_plot_path and os.path.exists(temp_plot_path):
        try:
            output_plot_drive_path = os.path.join(output_directory, output_plot_filename)
            shutil.copy(temp_plot_path, output_plot_drive_path)
            print(f"Plot successfully saved to Google Drive: {output_plot_drive_path}")
            print(f"Offering plot image '{output_plot_filename}' for download...")
            files.download(output_plot_drive_path)
        except Exception as e:
            print(f"Error copying plot to Google Drive or providing download: {e}")
    elif 'temp_plot_path' in locals() and temp_plot_path and not os.path.exists(temp_plot_path):
        print(f"Error: Temporary plot file {temp_plot_path} not found. Plot might not have been saved correctly in Step 6.")
    else:
        print("Skipping plot saving/downloading as the plot was not generated or saved in Step 6.")

    if 'word_ratios_df' in locals() and not word_ratios_df.empty:
        try:
            output_csv_drive_path = os.path.join(output_directory, output_csv_filename)
            df_to_save = word_ratios_df.reset_index()
            df_to_save.to_csv(output_csv_drive_path, index=False)
            print(f"Data CSV successfully saved to Google Drive: {output_csv_drive_path}")
            print(f"Offering data CSV '{output_csv_filename}' for download...")
            files.download(output_csv_drive_path)
        except Exception as e:
            print(f"Error saving data CSV to Google Drive or providing download: {e}")
    else:
        print("Skipping data CSV saving/downloading as 'word_ratios_df' is empty or not defined from Step 6.")

print("Step 7 execution finished.")

# ✅ Analysis Complete!

Thank you for using the **Plot Collocations Pair Tool**!

## Summary of Results:
The analysis has finished. If all steps were successful:
*   Your collocation plot (e.g., `collocation_plot.png`) and the associated data CSV (e.g., `bigram_ratios.csv`) have been saved to your Google Drive in the directory specified in Step 3 (typically `MyDrive/Colab_Data_Collocations_Pair`).
*   Download links for these files were also provided directly in Step 7.

## Troubleshooting Common Issues:

If you encountered any problems, here are some common troubleshooting tips:

1.  **File Not Found Errors**:
    *   **CSV Upload (Step 4)**: Ensure you uploaded your CSV file successfully. If you re-run the notebook, you might need to re-upload.
    *   **Google Drive Path (Step 3 & 7)**: Verify that Google Drive was mounted correctly and the `output_directory` path is valid. Check for typos.
    *   **Temporary Plot File (Step 7)**: If you see errors about `/tmp/plot_for_download.png` not found, it means Step 6 failed to save the plot image before Step 7 tried to access it. Review errors in Step 6.

2.  **Incorrect Text Column (Step 5)**:
    *   Double-check that the `text_column_name` you specified in Step 5 exactly matches the column header in your CSV file that contains the lemmatized text.

3.  **CSV Encoding Issues**:
    *   If you see errors when pandas tries to read the CSV (`pd.read_csv()` in Step 6), your file might have a non-standard encoding. Try specifying the encoding, e.g., `pd.read_csv(input_csv_name, encoding='latin1')` or `encoding='iso-8859-1'`. UTF-8 is the default and usually preferred.

4.  **Google Drive Permissions/Quota (Step 3 & 7)**:
    *   Ensure Colab has the necessary permissions to access your Google Drive.
    *   Check if your Google Drive has sufficient storage space.
    *   Sometimes, re-mounting Google Drive (by re-running Step 3, potentially with `force_remount=True`) can resolve transient issues.

5.  **Handling of Large Files / Colab Limits**:
    *   Processing very large text corpora can be time-consuming and memory-intensive. Colab has usage limits. If the notebook crashes or is very slow, consider:
        *   Using a smaller subset of your data for testing.
        *   Optimizing data loading and processing if possible (though the current script is reasonably standard).
        *   Using a more powerful environment if your task consistently exceeds Colab's capabilities.

6.  **Plot Not Looking as Expected (Step 6 Output)**:
    *   **Input Words (Step 5)**: Ensure the `word1_input` and `word2_input` are correctly specified and are present in your data.
    *   **Data Quality**: The quality of the plot depends heavily on the input text data (e.g., quality of lemmatization, size of corpus).
    *   **Filtering Thresholds (Step 6)**: The script filters for words (collocates) that appear more than 10 times with the target words. If your dataset is small or your target words are infrequent, you might get few or no results. Consider adjusting this threshold in the code of Step 6 if necessary (look for `w2_group_totals > 10`).
    *   **Zero Counts / Errors**: If your chosen words (e.g., 'he', 'she') or their collocates are very infrequent or absent in the provided text column, you might see empty plots, zero log ratios, or errors. Check the console output in Step 6 for messages about counts.

7.  **NLTK Resource Download (Step 2)**:
    *   The notebook attempts to download the 'punkt' tokenizer models for NLTK if they're not found. If this fails, it's usually due to network issues. Try running Step 2 again.

If issues persist, carefully review the error messages in each cell's output. These messages often provide specific clues about what went wrong.

Happy analyzing!