# Sentiment Model Training for Mental Health Chatbot
    
This Jupyter Notebook guides you through the process of training a sentiment analysis model using a Kaggle dataset. The trained model will then be used by the chatbot to understand user emotions.
   

## 1. Setup and Library Installation
    First, ensure you have the necessary libraries installed. If you encounter errors, run the `pip install` commands in your terminal or directly in a new notebook cell prefixed with `!` (e.g., `!pip install pandas`)

In [1]:
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
import joblib # For saving and loading the model and vectorizer

 ## 2. Download NLTK Data
    
    NLTK (Natural Language Toolkit) requires some data files for tokenization and stopwords. Run this cell to download them.

In [4]:
import nltk
try:
    nltk.data.find('corpora/stopwords')
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('stopwords')
    nltk.download('punkt')

print("NLTK data checked/downloaded.")

NLTK data checked/downloaded.


## 3. Load the Dataset
    Make sure you have downloaded `Sentiment_Analysis_for_Mental_Health.csv` from Kaggle and placed it in the `data/` folder inside your `chatbot_project` directory.

In [22]:
import pandas as pd

DATASET_PATH = 'data/Sentiment_Analysis_for_Mental_Health.csv'

try:
    df = pd.read_csv(DATASET_PATH)
    df.columns = df.columns.str.strip() # Good practice to strip whitespace

    print("Dataset loaded successfully. Head of the dataframe:")
    print(df.head())

    print("\n--- IMPORTANT: Columns in the dataset (check these carefully!) ---")
    print(df.columns.tolist()) # <<< LOOK AT THIS OUTPUT!
    print("-------------------------------------------------------------------")


except FileNotFoundError:
    print(f"Error: Dataset not found at '{DATASET_PATH}'. Please ensure it's in the 'data/' folder.")
    df = None
except Exception as e:
    print(f"An unexpected error occurred while loading the dataset: {e}")
    df = None

Dataset loaded successfully. Head of the dataframe:
   Unnamed: 0                                          statement   status
0           0                                         oh my gosh  Anxiety
1           1  trouble sleeping, confused mind, restless hear...  Anxiety
2           2  All wrong, back off dear, forward doubt. Stay ...  Anxiety
3           3  I've shifted my focus to something else but I'...  Anxiety
4           4  I'm restless and restless, it's been a month n...  Anxiety

--- IMPORTANT: Columns in the dataset (check these carefully!) ---
['Unnamed: 0', 'statement', 'status']
-------------------------------------------------------------------


## 4. Data Preprocessing
    We'll clean the text and map the mental health status labels to standard sentiment categories (positive, negative, neutral).
   

In [None]:
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# --- Configuration ---
DATASET_PATH = '../data/Sentiment_Analysis_for_Mental_Health.csv'

# --- NLTK Data Download (ensure these are downloaded if running standalone) ---
# If you run this as a standalone script, you might need to uncomment and run these
# lines once to ensure NLTK data is available.
# try:
#     nltk.data.find('corpora/stopwords')
#     nltk.data.find('tokenizers/punkt')
#     nltk.data.find('sentiment/vader_lexicon')
# except nltk.downloader.DownloadError:
#     print("Downloading NLTK data... This might take a moment.")
#     nltk.download('stopwords')
#     nltk.download('punkt')
#     nltk.download('vader_lexicon')
# print("NLTK data checked/downloaded.")


# --- Section 3: Load the Dataset ---
print("--- Starting Data Loading ---")
df = None # Initialize df to None for error handling

try:
    df = pd.read_csv(DATASET_PATH)
    df.columns = df.columns.str.strip() # Strip whitespace from column names

    print("Dataset loaded successfully. Head of the dataframe:")
    print(df.head())

    print("\n--- IMPORTANT: Columns in the dataset (verify these!) ---")
    print(df.columns.tolist())
    print("----------------------------------------------------------")

    # This diagnostic block will check if 'status' column is present
    if 'status' in df.columns:
        print("\nValue counts for 'status' column:")
        print(df['status'].value_counts())
    else:
        print("\n'status' column not found. Please ensure your CSV has a column named 'status' (lowercase) or update the code accordingly.")
        df = None # Set df to None to prevent further errors if crucial column is missing

except FileNotFoundError:
    print(f"Error: Dataset not found at '{DATASET_PATH}'. Please ensure it's in the 'data/' folder.")
    print("Please download the dataset from: https://www.kaggle.com/datasets/suchintikasarkar/sentiment-analysis-for-mental-health")
    df = None
except Exception as e:
    print(f"An unexpected error occurred while loading the dataset: {e}")
    df = None


# --- Section 4: Data Preprocessing ---
print("\n--- Starting Data Preprocessing ---")

if df is not None and not df.empty:
    try:
        # SELECTING THE CORRECT COLUMNS:
        # Based on your output: ['Unnamed: 0', 'statement', 'status']
        # We select 'statement' for text and 'status' for the original label.
        df = df[['statement', 'status']].copy()
        df.columns = ['text', 'label_original'] # Standardizing internal column names

        # --- TEMPORARY DEBUGGING LINE (UNCOMMENT TO CHECK UNIQUE LABELS) ---
        # print("\nUnique values found in 'label_original' (your 'status' column):")
        # print(df['label_original'].unique())
        # print("------------------------------------------------------------")
        # --- END OF TEMPORARY DEBUGGING LINE ---

    except KeyError as e:
        print(f"Error: Missing expected column in dataset for preprocessing: {e}.")
        print("Please ensure your CSV has 'statement' and 'status' columns (case-sensitive) or update the selection: df[['your_text_column', 'your_status_column']].copy()")
        df = None

if df is not None and not df.empty:
    # LABEL MAPPING:
    # This dictionary maps the original unique values from your 'status' column
    # (now named 'label_original') to standardized sentiment categories.
    # MAKE SURE THE KEYS HERE EXACTLY MATCH THE UNIQUE VALUES FROM YOUR 'status' COLUMN,
    # INCLUDING CASE. If your 'status' column has 'normal' (lowercase), use 'normal' as key.
    LABEL_MAPPING = {
        'Depression': 'negative',
        'Suicidal': 'negative',
        'Anxiety': 'negative',
        'Stress': 'negative',
        'Bi-Polar': 'negative',
        'Personality Disorder': 'negative',
        'Normal': 'neutral'
        # If your 'status' column values are, for example, all lowercase:
        # 'depression': 'negative',
        # 'suicidal': 'negative',
        # 'anxiety': 'negative',
        # 'stress': 'negative',
        # 'bi-polar': 'negative',
        # 'personality disorder': 'negative',
        # 'normal': 'neutral'
    }

    # Filter out rows where 'label_original' is not in our LABEL_MAPPING
    initial_rows = len(df)
    df = df[df['label_original'].isin(LABEL_MAPPING.keys())].copy()
    filtered_rows = len(df)
    if initial_rows != filtered_rows:
        print(f"Warning: Filtered out {initial_rows - filtered_rows} rows due to unmapped 'label_original' values.")
        print(f"Remaining rows: {filtered_rows} rows.")
    if df.empty:
        print("\nFATAL WARNING: DataFrame is EMPTY after filtering/mapping. This means no valid labels were found after filtering.")
        print("Please check your LABEL_MAPPING keys against the actual unique values in your 'status' column (`df['label_original'].unique()`).")
        df = None # Ensure df is None if it's empty to prevent downstream errors.

if df is not None and not df.empty:
    # Create the 'sentiment' column by mapping the 'label_original' values
    df['sentiment'] = df['label_original'].map(LABEL_MAPPING)

    # Text cleaning function
    def preprocess_text(text):
        if not isinstance(text, str):
            return ""
        text = text.lower()
        text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)
        tokens = word_tokenize(text)
        stop_words = set(stopwords.words('english'))
        filtered_tokens = [word for word in tokens if word not in stop_words]
        return " ".join(filtered_tokens)

    df['cleaned_text'] = df['text'].apply(preprocess_text)

    print("\nData after preprocessing and sentiment mapping:")
    print(df.head())
    print("\nSentiment distribution (after mapping):")
    print(df['sentiment'].value_counts())
else:
    print("Skipping preprocessing: DataFrame is empty or not loaded correctly due to previous errors.")

Remaining rows: 48965 rows.

Data after preprocessing and sentiment mapping:
                                                text label_original  \
0                                         oh my gosh        Anxiety   
1  trouble sleeping, confused mind, restless hear...        Anxiety   
2  All wrong, back off dear, forward doubt. Stay ...        Anxiety   
3  I've shifted my focus to something else but I'...        Anxiety   
4  I'm restless and restless, it's been a month n...        Anxiety   

                                        cleaned_text  
0                                            oh gosh  
1  trouble sleeping confused mind restless heart ...  
2  wrong back dear forward doubt stay restless re...  
3  ive shifted focus something else im still worried  
4                im restless restless month boy mean  

Sentiment distribution (after mapping):


KeyError: 'sentiment'

## 5. Feature Extraction (TF-IDF)
    
    We'll convert the text data into numerical features that a machine learning model can understand using TF-IDF (Term Frequency-Inverse Document Frequency).


In [None]:
if df is not None:
        X = df['cleaned_text']
        y = df['sentiment']
    
        # Split data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
        # Initialize and fit TF-IDF Vectorizer
        tfidf_vectorizer = TfidfVectorizer(max_features=5000) # Limit features to avoid sparsity/memory issues
        X_train_vec = tfidf_vectorizer.fit_transform(X_train)
        X_test_vec = tfidf_vectorizer.transform(X_test)
    
        print(\"TF-IDF Vectorizer fitted. Shape of training data:\", X_train_vec.shape)

## 6. Train the Sentiment Model (Logistic Regression)
    
    We'll use a Logistic Regression model for sentiment classification. It's a good baseline and often performs well for text classification tasks.
   

In [None]:
if df is not None:
        # Initialize and train Logistic Regression model
        sentiment_model = LogisticRegression(max_iter=1000, solver='liblinear') # 'liblinear' often good for small datasets
        sentiment_model.fit(X_train_vec, y_train)
        print("Model training complete.")


## 7. Evaluate the Model
    
    It's important to evaluate how well our model performs on unseen data.
   

In [None]:
if df is not None:
        y_pred = sentiment_model.predict(X_test_vec)
        print(\"\\nClassification Report:\")
        print(classification_report(y_test, y_pred))
    
        print(\"\\nAccuracy Score:\", accuracy_score(y_test, y_pred))
    
        # Test with some custom sentences
        test_sentences = [
            "I feel so sad and alone today, nothing is going right.\",
            "Today was a great day! I had so much fun with my friends.\",
            "I'm just doing homework, feeling okay.\",
            "Everything is terrible, I don't know what to do anymore.\",
            "I am so happy with my progress in school.\"
    
        print(\"\\nTesting with custom sentences:\")
        for sentence in test_sentences:
            cleaned_sentence = preprocess_text(sentence)
            sentence_vec = tfidf_vectorizer.transform([cleaned_sentence])
            predicted_sentiment = sentiment_model.predict(sentence_vec)[0]
            print(f\"Sentence: '{sentence}' -> Predicted Sentiment: {predicted_sentiment}\")

## 8. Save the Model and Vectorizer
    
    To use the trained model in our chatbot application, we need to save it along with the TF-IDF vectorizer. We'll save them in the `chatbot_project/` root directory.
   

In [None]:
if sentiment_model is not None and tfidf_vectorizer is not None:
        joblib.dump(sentiment_model, '../sentiment_model.pkl')
        joblib.dump(tfidf_vectorizer, '../tfidf_vectorizer.pkl')
        print(\"Model and vectorizer saved successfully as sentiment_model.pkl and tfidf_vectorizer.pkl.\")
    else:\n",
        print(\"Model or vectorizer not trained/available, skipping save.\")
   

## 9. Conclusion
    You have successfully trained and saved a sentiment analysis model. The `sentiment_analysis.py` file in the main project directory will now load these saved files to predict sentiment, allowing the chatbot to respond more intelligently to user input.