This notebook performs sentiment analysis and topic modeling on a dataset of COVID-19 related tweets. The script uses the TextBlob library for sentiment analysis and the Latent Dirichlet Allocation (LDA) algorithm to uncover latent topics within the tweet text. Additionally, a classifier is trained to categorize tweets into three sentiment categories: positive, negative, and neutral.
Before running the script, ensure you have installed the necessary libraries. You can install all the dependencies with the following command:
pip install pandas matplotlib seaborn textblob scikit-learn wordcloud tqdm
-
Clone the repository:
- First, clone this repository to your local environment or Jupyter Notebook.
-
Obtain the dataset:
- Ensure you have the CSV file containing the tweets in the specified path within the notebook. The dataset should include a column with the tweet text.
-
Run the notebook:
- Open the notebook in Jupyter Notebook and execute the cells sequentially to perform the analysis.
- The notebook starts by loading the Twitter data from the specified CSV file using the Pandas library.
- Sentiment analysis is then performed on each tweet using TextBlob, and the sentiment polarity scores are added to the dataframe.
- Topic modeling is performed using the Latent Dirichlet Allocation (LDA) algorithm, a popular technique for discovering hidden topics in large text datasets.
- The notebook defines the function
get_lda_topics()
to preprocess the text, create a document-term matrix, and fit the LDA model.
- Functions such as
plot_lda_topics()
andplot_wordclouds()
are used to visualize the topics generated by the LDA model. - These visualizations include displaying the most frequent words for each topic and creating word clouds to graphically represent the topics.
- The distribution of sentiment polarity scores is visualized using a histogram, giving a clear view of how sentiments are distributed across the tweets.
- A Naive Bayes classifier is trained to classify tweets into three sentiment categories: positive, negative, and neutral.
- The model is trained using the labeled dataset and the sentiment polarity scores extracted earlier.
- The performance of the classifier is evaluated using metrics such as precision, recall, F1-score, and accuracy.
- A confusion matrix and classification report are generated to assess the classifier's effectiveness and identify which sentiment categories were more difficult to classify correctly.
When running the notebook, you will obtain the following results:
- Topic Visualizations: Images displaying the topics generated by the LDA model and the most common words associated with each topic. This helps understand the main issues being discussed in the COVID-19 tweets.
- Sentiment Distribution: Graphs showing the distribution of sentiments (positive, negative, and neutral) in the tweets.
- Classifier Performance: Performance metrics such as model accuracy, the confusion matrix, and the classification report, providing detailed insights into the classifier's effectiveness.