<a href="https://colab.research.google.com/github/jessicasmelton/YTCommentAnalysis/blob/main/Data%20Cleaning%20Program.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Cleaning Program:**

This program is designed to clean and preprocess YouTube comment data extracted from an Excel file. The program performs several key operations: it converts text to lowercase, removes punctuation and special characters, and eliminates common English stop words. The cleaned comments are then saved to a new CSV file for further analysis.


---


**Usage**

* Ensure your Excel file is correctly formatted and saved in the specified location. The program skips the first row due to a known issue with reading columns.

* Replace the input_file variable value in the code with the path to your Excel file. Make sure the file path is correctly specified to avoid file not found errors.

* Execute the program in a Python environment such as Google Colab, Jupyter Notebook, or any local Python environment.

* The program reads the Excel file, processes the comments, and saves the cleaned data to a new CSV file.


---


**Notes**
* The program checks for the existence of a column named Comment Text in the combined DataFrame. If the column is missing, the program raises a KeyError.

* The program uses a basic set of English stop words. You can customize the stop_words set in the preprocess_text function to include additional stop words as needed.


---


**Potential Errors and Fixes**

* Ensure the file path to the Excel file is correct. Verify that the file exists at the specified location.

* If the 'Comment Text' column is missing, ensure that all sheets in the Excel file contain this column. The program relies on this column for preprocessing.

* If there are issues with saving the CSV file, check for special characters in the file path or name that may cause problems. Ensure the directory where the file is being saved exists and is writable.

In [None]:
# Data Cleaning Program (for Excel Files)

# This program skips the first row because there was an error
# reading the columns in the the first interation of this program.

# Import necessary libraries
import pandas as pd  # Library for data manipulation and analysis
import string  # Library for string operations

# Function to preprocess the text in the 'Comment Text' column
def preprocess_text(text):
    # Check if the input is a string
    if not isinstance(text, str):
        return ""  # Return empty string if the input is not a string

    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation and special characters using string translation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokenize text by splitting it into words (tokens) based on spaces
    tokens = text.split()

    # Define a basic set of stop words to be removed from the text
    stop_words = set([
        'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
        'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself',
        'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them',
        'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this',
        'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been',
        'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
        'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
        'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between',
        'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to',
        'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again',
        'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how',
        'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some',
        'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
        'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'
    ])
    # Remove stop words from the tokenized text
    tokens = [word for word in tokens if word not in stop_words]

    # Join the cleaned tokens back into a single string
    cleaned_text = ' '.join(tokens)
    return cleaned_text

# Read the Excel file, skipping the first row due to an error in reading columns initially
input_file = 'INSERT FILE PATH HERE.xlsx'
df = pd.read_excel(input_file, sheet_name=None, skiprows=1)  # Skip the first row

# Initialize an empty DataFrame to combine data from all sheets
combined_df = pd.DataFrame()

# Iterate over each sheet in the Excel file and combine them into one DataFrame
for sheet_name, sheet_df in df.items():
    combined_df = pd.concat([combined_df, sheet_df], ignore_index=True)

# Strip any leading/trailing spaces from column names to ensure consistency
combined_df.columns = combined_df.columns.str.strip()

# Verify that the 'Comment Text' column exists in the combined DataFrame
if 'Comment Text' not in combined_df.columns:
    raise KeyError("The combined data does not contain a column named 'Comment Text'. Please check the column names.")

# Apply the preprocessing function to the 'Comment Text' column and create a new column 'Cleaned Comment Text'
combined_df['Cleaned Comment Text'] = combined_df['Comment Text'].apply(preprocess_text)

# Save the cleaned DataFrame to a new CSV file
output_file = '/CLEANED_Youtube_Comment_Data.csv'
combined_df.to_csv(output_file, index=False)

# Print a completion message indicating the location of the saved file
print(f"Preprocessed data has been saved to {output_file}")

**Data Cleaning Program That Translate Non-English Comments to English:**

This program translates YouTube comments from various languages to English and then cleans the translated comments. It handles text processing tasks such as converting text to lowercase, removing punctuation, eliminating stop words, and applying lemmatization. The processed comments are saved to a new CSV file.


---

**Usage**

* Ensure your CSV file (from the previous cleaning step) is correctly formatted and saved in the specified location. The file should contain the cleaned comments from the first data cleaning program.

* Replace the file_path variable value in the code with the path to your CSV file. Make sure the file path is correctly specified to avoid file not found errors.

* Execute the program in a Python environment such as Google Colab, Jupyter Notebook, or any local Python environment.

* The program reads the CSV file, processes the comments, and saves the cleaned and translated data to a new CSV file.

**Notes**

* The program assumes the CSV file contains a column named Comment Text. If the column is missing, the program will not function correctly.
Language Detection:

* The program detects the language of each comment and translates non-English comments to English. Very short comments or those that cannot be detected are marked as 'unknown'.

* The program uses a basic set of English stop words. You can customize the stop_words set in the remove_stopwords function to include additional stop words as needed.



---



**Potential Errors and Fixes**

* Ensure the file path to the CSV file is correct. Verify that the file exists at the specified location.

* If the 'Comment Text' column is missing, ensure that the CSV file contains this column. The program relies on this column for preprocessing and translation.

* If translation errors occur, they are logged, and the original text is retained. Check the console output for specific error messages.

* If there are issues with saving the CSV file, check for special characters in the file path or name that may cause problems. Ensure the directory where the file is being saved exists and is writable.

In [None]:
# Install First

!pip install googletrans==4.0.0-rc1
!pip install nltk
!pip install langdetect

In [None]:
# Import necessary libraries
import pandas as pd  # Library for data manipulation and analysis
from googletrans import Translator  # Library for translating text
import nltk  # Natural Language Toolkit for text processing
from nltk.corpus import stopwords  # Module for stop words
from nltk.tokenize import word_tokenize  # Module for tokenizing text
from nltk.stem import WordNetLemmatizer  # Module for lemmatizing words
from langdetect import detect, LangDetectException  # Library for detecting language
import re  # Regular expressions library for text cleaning

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the cleaned CSV file containing YouTube comments
file_path = '/content/Cleaned_Youtube_Comments.csv'
df = pd.read_csv(file_path)

# Initialize the Google Translator and NLTK Lemmatizer
translator = Translator()
lemmatizer = WordNetLemmatizer()

# Function to clean text by converting to lowercase, removing punctuation, special characters, and numbers
def clean_text(text):
    text = text.lower()  # Convert text to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation and special characters
    text = re.sub(r'\d+', '', text)  # Remove numbers
    return text

# Function to remove stop words from text
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))  # Set of English stop words
    word_tokens = word_tokenize(text)  # Tokenize text into words
    filtered_text = [word for word in word_tokens if word not in stop_words]  # Remove stop words
    return filtered_text

# Function to apply lemmatization to a list of words
def apply_lemmatization(words):
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]  # Lemmatize words
    return ' '.join(lemmatized_words)  # Join lemmatized words back into a single string

# Function to translate text to English, clean it, remove stop words, and apply lemmatization
def translate_and_clean(text):
    try:
        translated = translator.translate(text, dest='en').text  # Translate text to English
        cleaned_text = clean_text(translated)  # Clean the translated text
        tokens = remove_stopwords(cleaned_text)  # Remove stop words from cleaned text
        lemmatized_text = apply_lemmatization(tokens)  # Apply lemmatization to the tokens
        return lemmatized_text  # Return the processed text
    except Exception as e:
        print(f"Error processing text: {text}\nError: {e}")  # Print error message if processing fails
        return text  # Return the original text if processing fails

# Function to detect the language of text with checks for empty or very short comments
def detect_language(text):
    try:
        if len(text.strip()) < 3:  # Skip very short texts
            return 'unknown'
        return detect(text)  # Detect the language of the text
    except LangDetectException:
        return 'unknown'  # Return 'unknown' if language detection fails

# Apply language detection to each comment and create a new column 'Language'
df['Language'] = df['Comment Text'].apply(detect_language)

# Filter out comments that are not in English
non_english_df = df[df['Language'] != 'en']

# Translate and clean non-English comments
non_english_df['Comment Text'] = non_english_df['Comment Text'].apply(translate_and_clean)

# Update the original DataFrame with the translated comments
df.update(non_english_df)

# Drop the 'Language' column from the DataFrame as it is no longer needed
df.drop(columns=['Language'], inplace=True)

# Save the cleaned and translated comments to a new CSV file
output_file_path = '/content/CLEANED_Translated_Youtube_Comments.csv'
df.to_csv(output_file_path, index=False)

# Print completion message indicating the location of the saved file
print(f"Cleaned and translated comments have been saved to {output_file_path}")