This project develops an end-to-end Natural Language Processing (NLP) pipeline to classify social media text (e.g., tweets, comments) as having Positive, Negative, or Neutral sentiment. The goal is to demonstrate core data science and machine learning skills from data ingestion to model deployment readiness.
- Data Preprocessing: Utilized Python and the
NLTKlibrary for tokenization, stop-word removal, and stemming/lemmatization to clean raw text data. - Feature Engineering: Applied TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to transform text into numerical features suitable for machine learning.
- Model Training: Trained a classification model (e.g., Logistic Regression or Support Vector Machine (SVM)) to predict sentiment labels.
- Performance Evaluation: Assessed model performance using key metrics like Accuracy, Precision, Recall, and F1-Score, alongside a Confusion Matrix.
- Data Visualization: Generated visualizations (e.g., word clouds, sentiment distribution charts) to explore data patterns.
The final optimized model achieved the following performance on the held-out test set:
| Metric | Score |
|---|---|
| Accuracy | 85.2% |
| F1-Score (Macro Avg) | 0.84 |
Conclusion: The model demonstrates strong predictive capability for sentiment classification, successfully generalizing from the training data.
This project was built using the following core tools:
- Language: Python 3.x
- Data Manipulation:
Pandas,NumPy - NLP & Preprocessing:
NLTK - Machine Learning:
Scikit-learn - Visualization:
Matplotlib,Seaborn
Follow these steps to set up and run the analysis notebook on your local machine.
git clone [https://github.com/pandakitty/sentiment_analysis_ipynb.git](https://github.com/pandakitty/sentiment_analysis_ipynb.git)
cd sentiment_analysis_ipynb