# Weather Sentiment Analysis Project

Welcome to this beginner-friendly NLP project! We will build a text classifier to understand how people feel about the weather.

## What we will learn:
1. **Text Preprocessing**: Cleaning up raw text data.
2. **Vectorization (TF-IDF)**: Converting text into numbers the computer can understand.
3. **Model Training**: Teaching a Machine Learning model to recognize patterns.
4. **Prediction**: Testing our model on new sentences.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Check if data exists
try:
    df = pd.read_csv('data/weather_sentiment_samples.csv')
    print("✅ Data loaded successfully!")
    print(df.head())
except FileNotFoundError:
    print("⚠️ Data file not found. Create it first or update the path.")

## Step 1: Explore the Data
Let's look at how many samples we have for each sentiment (positive, negative, neutral).

In [None]:
print(df['label'].value_counts())

sns.countplot(x='label', data=df)
plt.title('Distribution of Sentiments')
plt.show()

## Step 2: Prepare Data for Training
We need to split our data into training (for learning) and testing (for evaluation) sets.

In [None]:
X = df['text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

## Step 3: Vectorization (TF-IDF)
Computers works with numbers, not words. We use **TF-IDF (Term Frequency-Inverse Document Frequency)** to convert text into numerical vectors.
- **TF**: How often a word appears in a document.
- **IDF**: How rare a word is across all documents.

In [None]:
vectorizer = TfidfVectorizer(stop_words='english')

# Fit (learn vocabulary) and transform (convert to numbers) the training data
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the test data (using the vocabulary learned from training)
X_test_tfidf = vectorizer.transform(X_test)

print("Shape of training matrix:", X_train_tfidf.shape)

## Step 4: Train the Model
We'll use **Naive Bayes**, a simple yet effective algorithm for text classification.

In [None]:
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)
print("✅ Model trained!")

## Step 5: Evaluate the Model
Let's see how well it performs on the test data.

In [None]:
y_pred = model.predict(X_test_tfidf)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

## Step 6: Test on New Sentences
Now for the fun part! Let's test it with your own weather descriptions.

In [None]:
new_sentences = [
    "I absolutely love this warm sunshine!",
    "The thunder is scary and I hate it.",
    "It is partly cloudy today.",
    "Freezing cold weather makes me miserable."
]

new_vectors = vectorizer.transform(new_sentences)
predictions = model.predict(new_vectors)

for sentence, sentiment in zip(new_sentences, predictions):
    print(f"Sentence: '{sentence}' -> Sentiment: {sentiment}")